Simplify document search at scale with intelligent search bot on AWS

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/simplify-document-search-at-scale-with-intelligent-search-bot-on-aws/

Enterprise document management systems (EDMS) manage the lifecycle and distribution of documents. They often rely on keyword-based search functionality. However, it increasingly becomes hard to discover documents as such repositories grow to tens of thousands of items.

In this blog, we discuss how Amazon Web Services (AWS) built an intelligent search bot on top of the document repository of a global life sciences company. Before this enhancement, the native repository search function relied solely on keywords and document names, leading to a trial-and-error process. Now, scientists can effortlessly query the repository using natural language to locate the right document in a few seconds—even through voice commands while working with lab equipment.

Use case

In life sciences, documentation is critical for regulatory compliance with GxP. Scientists in life sciences use EDMS on a daily basis to retrieve standard operating procedures (SOPs). SOPs contain task-level instructions (for example, how to monitor lab equipment or use utilities such as steam generators and chilled water circulation pumps).

EDMS search capability is often limited to file names and metadata. In our use case, file names were numerical and metadata was typically a single-sentence description plus keywords.

Scientists wanted to query for a particular context and type of task they’re about to perform. However, this data was not extracted from document content and thus not readily available for search purposes. Scientists also wanted to be able to search by using a voice interface (for example, when they operate on lab equipment).

To address this, we designed a conversational bot interface. This bot locates the most relevant SOP based on pre-generated document extracts and returns a hyperlink to the most suitable document, as shown in Figure 1.

Example of document search prompt and chatbot response

Figure 1. Example of document search prompt and chatbot response

Overview of the intelligent search bot solution

Our solution requirements for intelligent search were:

  • Semantic search index and ranking based on full text
  • Search capability through voice and text
  • Out-of-the-box integration with web applications and mobile devices

We choose Amazon Lex to provide the conversational interface using text or speech. Lex bots can be integrated with web and mobile applications using AWS Amplify Interactions. We used Amazon Kendra to create an intelligent search index on top of the data repository, which we hosted on Amazon Simple Storage Service (Amazon S3).

The advantage of using an Amazon Kendra index is its out-of-the-box semantic search and ranking capability based on document content. We use AWS Lambda to take care of Amazon S3 path mappings, and document attribute formatting for replicated documents, in order to make them retrievable by Amazon Kendra.

Intelligent search bot for enterprise document management systems

Figure 2. Intelligent search bot for enterprise document management systems

Benefits of integrating intelligent search bot with your EDMS

The benefits of extending EDMS with intelligent search bot include:

  • Improved usability by adding text- and speech-based channels to match user situations (for example, scientists operating on lab equipment)
  • Native, out-of-the-box integration with third-party systems (for example, Adobe Experience Manager, Alfresco, HubSpot, Marketo, Salesforce)
  • Implementation timeboxed in a two-week agile sprint and requires no data science skills

Extensibility to large language models

The Amazon Kendra Retrieve API allows the extension to a document retrieval chain pattern using a large language model (LLM) from Amazon Bedrock or Amazon SageMaker Jumpstart. With this pattern, Amazon Kendra generated document summaries can be passed to the LLM for correlation, as shown in Figure 3. Consult the LangChain documentation to learn how to configure retrieval chains.

Extending the document search bot to large language model

Figure 3. Extending the document search bot to large language model

In our use case, the effort for such extension went beyond the incremental optimizations and limited migration timeframe. Also, preference for a chain pattern should be given to complex correlations across documents.

This wasn’t the case here, as documents were functionally disjunct (for example, by device type, geographical site, and process task). Therefore, document retrieval with the Amazon Kendra API was sufficient and we deferred the extra effort associated with custom-built LLM prompt layers.

Implementation considerations

We started the EDMS migration to AWS by replicating the data repository to Amazon S3 by using AWS DataSync. We stored every document and corresponding metadata files as separate Amazon S3 objects.

For the Amazon Kendra index mappings to initialize properly:

  • Metadata must be a JSON Amazon S3 object
  • Follow the name convention <document>.<extension>.metadata.json
  • Have reserved or common document attributes correctly formatted

The EDMS system did not adhere to this when generating metadata files so we offloaded the transformation to a Lambda function. The function fixed metadata attributes such as, changing version type (_version) from numeric to string and date (_created_at) from string to ISO8601. It also changed metadata names/Amazon S3 paths by enacting new objects (PutObject API) and deleting the original objects (DeleteObject API).

We configured Lambda invocation on Amazon S3 PutObject operations using Amazon S3 event notifications. We set the sync run schedule for the Amazon Kendra index to run on demand.

Alternatively, you can run it on a predefined schedule or as part of each Lambda invocation (using the update_index boto3 operation). Finally, we monitor for sync run fails associated with the Amazon Kendra index using Amazon CloudWatch.

Conclusion

This blog showed how you can enhance the keyword-based search of your EDMS. We embedded document search queries behind a chatbot to simplify user interaction.

You can build this solution quickly, with no machine learning skills, as part of your EDMS cloud migration. In more advanced use cases, including complete refactoring, consider extending this solution to use a large language model from Amazon Bedrock or Amazon SageMaker Jumpstart.

Related information

8x NVIDIA Grace Hopper Superchips in a Blade HPE Cray EX254n at GTC 2024

Post Syndicated from Cliff Robinson original https://www.servethehome.com/8x-nvidia-grace-hopper-superchips-arm-in-a-blade-hpe-cray-ex254n-at-gtc-2024/

We found the HPE Cray EX254n at NVIDIA GTC 2024. This is an 8x NVIDIA Grace Hopper liquid cooled blade for HPE’s supercomputing platforms

The post 8x NVIDIA Grace Hopper Superchips in a Blade HPE Cray EX254n at GTC 2024 appeared first on ServeTheHome.

Магистрати с имоти и сметки в чужбина В “Списъка на Рашков” са Адалберт Кръстев и бивша на Еврото

Post Syndicated from Екип на Биволъ original https://bivol.bg/rashkov-list-2.html

четвъртък 21 март 2024


Шефът на военната прокуратура Адалберт Кръстев и следовател Оля Антонова, която е живяла на семейни начала с Пепи Еврото, са най-ярките имена в списъка на прокурори и следователи с имоти…

[$] Hardening the kernel against heap-spraying attacks

Post Syndicated from corbet original https://lwn.net/Articles/965837/

While a programming error in the kernel may be subject to direct
exploitation, usually a more roundabout approach is required to take
advantage of a security bug. One popular approach for those wishing to
take advantage of vulnerabilities is heap spraying, and
it has often been employed to compromise the kernel. In the future,
though, heap-spraying attacks may be a bit harder to pull off, thanks to the
“dedicated bucket allocator”
proposed by Kees Cook.

Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/build-an-end-to-end-serverless-streaming-pipeline-with-apache-kafka-on-amazon-msk-using-python/

The volume of data generated globally continues to surge, from gaming, retail, and finance, to manufacturing, healthcare, and travel. Organizations are looking for more ways to quickly use the constant inflow of data to innovate for their businesses and customers. They have to reliably capture, process, analyze, and load the data into a myriad of data stores, all in real time.

Apache Kafka is a popular choice for these real-time streaming needs. However, it can be challenging to set up a Kafka cluster along with other data processing components that scale automatically depending on your application’s needs. You risk under-provisioning for peak traffic, which can lead to downtime, or over-provisioning for base load, leading to wastage. AWS offers multiple serverless services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Data Firehose, Amazon DynamoDB, and AWS Lambda that scale automatically depending on your needs.

In this post, we explain how you can use some of these services, including MSK Serverless, to build a serverless data platform to meet your real-time needs.

Solution overview

Let’s imagine a scenario. You’re responsible for managing thousands of modems for an internet service provider deployed across multiple geographies. You want to monitor the modem connectivity quality that has a significant impact on customer productivity and satisfaction. Your deployment includes different modems that need to be monitored and maintained to ensure minimal downtime. Each device transmits thousands of 1 KB records every second, such as CPU usage, memory usage, alarm, and connection status. You want real-time access to this data so you can monitor performance in real time, and detect and mitigate issues quickly. You also need longer-term access to this data for machine learning (ML) models to run predictive maintenance assessments, find optimization opportunities, and forecast demand.

Your clients that gather the data onsite are written in Python, and they can send all the data as Apache Kafka topics to Amazon MSK. For your application’s low-latency and real-time data access, you can use Lambda and DynamoDB. For longer-term data storage, you can use managed serverless connector service Amazon Data Firehose to send data to your data lake.

The following diagram shows how you can build this end-to-end serverless application.

end-to-end serverless application

Let’s follow the steps in the following sections to implement this architecture.

Create a serverless Kafka cluster on Amazon MSK

We use Amazon MSK to ingest real-time telemetry data from modems. Creating a serverless Kafka cluster is straightforward on Amazon MSK. It only takes a few minutes using the AWS Management Console or AWS SDK. To use the console, refer to Getting started using MSK Serverless clusters. You create a serverless cluster, AWS Identity and Access Management (IAM) role, and client machine.

Create a Kafka topic using Python

When your cluster and client machine are ready, SSH to your client machine and install Kafka Python and the MSK IAM library for Python.

  • Run the following commands to install Kafka Python and the MSK IAM library:
pip install kafka-python

pip install aws-msk-iam-sasl-signer-python
  • Create a new file called createTopic.py.
  • Copy the following code into this file, replacing the bootstrap_servers and region information with the details for your cluster. For instructions on retrieving the bootstrap_servers information for your MSK cluster, see Getting the bootstrap brokers for an Amazon MSK cluster.
from kafka.admin import KafkaAdminClient, NewTopic
from aws_msk_iam_sasl_signer import MSKAuthTokenProvider

# AWS region where MSK cluster is located
region= '<UPDATE_AWS_REGION_NAME_HERE>'

# Class to provide MSK authentication token
class MSKTokenProvider():
    def token(self):
        token, _ = MSKAuthTokenProvider.generate_auth_token(region)
        return token

# Create an instance of MSKTokenProvider class
tp = MSKTokenProvider()

# Initialize KafkaAdminClient with required configurations
admin_client = KafkaAdminClient(
    bootstrap_servers='<UPDATE_BOOTSTRAP_SERVER_STRING_HERE>',
    security_protocol='SASL_SSL',
    sasl_mechanism='OAUTHBEARER',
    sasl_oauth_token_provider=tp,
    client_id='client1',
)

# create topic
topic_name="mytopic"
topic_list =[NewTopic(name=topic_name, num_partitions=1, replication_factor=2)]
existing_topics = admin_client.list_topics()
if(topic_name not in existing_topics):
    admin_client.create_topics(topic_list)
    print("Topic has been created")
else:
    print("topic already exists!. List of topics are:" + str(existing_topics))
  • Run the createTopic.py script to create a new Kafka topic called mytopic on your serverless cluster:
python createTopic.py

Produce records using Python

Let’s generate some sample modem telemetry data.

  • Create a new file called kafkaDataGen.py.
  • Copy the following code into this file, updating the BROKERS and region information with the details for your cluster:
from kafka import KafkaProducer
from aws_msk_iam_sasl_signer import MSKAuthTokenProvider
import json
import random
from datetime import datetime
topicname='mytopic'

BROKERS = '<UPDATE_BOOTSTRAP_SERVER_STRING_HERE>'
region= '<UPDATE_AWS_REGION_NAME_HERE>'
class MSKTokenProvider():
    def token(self):
        token, _ = MSKAuthTokenProvider.generate_auth_token(region)
        return token

tp = MSKTokenProvider()

producer = KafkaProducer(
    bootstrap_servers=BROKERS,
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    retry_backoff_ms=500,
    request_timeout_ms=20000,
    security_protocol='SASL_SSL',
    sasl_mechanism='OAUTHBEARER',
    sasl_oauth_token_provider=tp,)

# Method to get a random model name
def getModel():
    products=["Ultra WiFi Modem", "Ultra WiFi Booster", "EVG2000", "Sagemcom 5366 TN", "ASUS AX5400"]
    randomnum = random.randint(0, 4)
    return (products[randomnum])

# Method to get a random interface status
def getInterfaceStatus():
    status=["connected", "connected", "connected", "connected", "connected", "connected", "connected", "connected", "connected", "connected", "connected", "connected", "down", "down"]
    randomnum = random.randint(0, 13)
    return (status[randomnum])

# Method to get a random CPU usage
def getCPU():
    i = random.randint(50, 100)
    return (str(i))

# Method to get a random memory usage
def getMemory():
    i = random.randint(1000, 1500)
    return (str(i))
    
# Method to generate sample data
def generateData():
    
    model=getModel()
    deviceid='dvc' + str(random.randint(1000, 10000))
    interface='eth4.1'
    interfacestatus=getInterfaceStatus()
    cpuusage=getCPU()
    memoryusage=getMemory()
    now = datetime.now()
    event_time = now.strftime("%Y-%m-%d %H:%M:%S")
    
    modem_data={}
    modem_data["model"]=model
    modem_data["deviceid"]=deviceid
    modem_data["interface"]=interface
    modem_data["interfacestatus"]=interfacestatus
    modem_data["cpuusage"]=cpuusage
    modem_data["memoryusage"]=memoryusage
    modem_data["event_time"]=event_time
    return modem_data

# Continuously generate and send data
while True:
    data =generateData()
    print(data)
    try:
        future = producer.send(topicname, value=data)
        producer.flush()
        record_metadata = future.get(timeout=10)
        
    except Exception as e:
        print(e.with_traceback())
  • Run the kafkaDataGen.py to continuously generate random data and publish it to the specified Kafka topic:
python kafkaDataGen.py

Store events in Amazon S3

Now you store all the raw event data in an Amazon Simple Storage Service (Amazon S3) data lake for analytics. You can use the same data to train ML models. The integration with Amazon Data Firehose allows Amazon MSK to seamlessly load data from your Apache Kafka clusters into an S3 data lake. Complete the following steps to continuously stream data from Kafka to Amazon S3, eliminating the need to build or manage your own connector applications:

  • On the Amazon S3 console, create a new bucket. You can also use an existing bucket.
  • Create a new folder in your S3 bucket called streamingDataLake.
  • On the Amazon MSK console, choose your MSK Serverless cluster.
  • On the Actions menu, choose Edit cluster policy.

cluster policy

  • Select Include Firehose service principal and choose Save changes.

firehose service principal

  • On the S3 delivery tab, choose Create delivery stream.

delivery stream

  • For Source, choose Amazon MSK.
  • For Destination, choose Amazon S3.

source and destination

  • For Amazon MSK cluster connectivity, select Private bootstrap brokers.
  • For Topic, enter a topic name (for this post, mytopic).

source settings

  • For S3 bucket, choose Browse and choose your S3 bucket.
  • Enter streamingDataLake as your S3 bucket prefix.
  • Enter streamingDataLakeErr as your S3 bucket error output prefix.

destination settings

  • Choose Create delivery stream.

create delivery stream

You can verify that the data was written to your S3 bucket. You should see that the streamingDataLake directory was created and the files are stored in partitions.

amazon s3

Store events in DynamoDB

For the last step, you store the most recent modem data in DynamoDB. This allows the client application to access the modem status and interact with the modem remotely from anywhere, with low latency and high availability. Lambda seamlessly works with Amazon MSK. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides these to your function as an event payload.

Lets first create a table in DynamoDB. Refer to DynamoDB API permissions: Actions, resources, and conditions reference to verify that your client machine has the necessary permissions.

  • Create a new file called createTable.py.
  • Copy the following code into the file, updating the region information:
import boto3
region='<UPDATE_AWS_REGION_NAME_HERE>'
dynamodb = boto3.client('dynamodb', region_name=region)
table_name = 'device_status'
key_schema = [
    {
        'AttributeName': 'deviceid',
        'KeyType': 'HASH'
    }
]
attribute_definitions = [
    {
        'AttributeName': 'deviceid',
        'AttributeType': 'S'
    }
]
# Create the table with on-demand capacity mode
dynamodb.create_table(
    TableName=table_name,
    KeySchema=key_schema,
    AttributeDefinitions=attribute_definitions,
    BillingMode='PAY_PER_REQUEST'
)
print(f"Table '{table_name}' created with on-demand capacity mode.")
  • Run the createTable.py script to create a table called device_status in DynamoDB:
python createTable.py

Now let’s configure the Lambda function.

  • On the Lambda console, choose Functions in the navigation pane.
  • Choose Create function.
  • Select Author from scratch.
  • For Function name¸ enter a name (for example, my-notification-kafka).
  • For Runtime, choose Python 3.11.
  • For Permissions, select Use an existing role and choose a role with permissions to read from your cluster.
  • Create the function.

On the Lambda function configuration page, you can now configure sources, destinations, and your application code.

  • Choose Add trigger.
  • For Trigger configuration, enter MSK to configure Amazon MSK as a trigger for the Lambda source function.
  • For MSK cluster, enter myCluster.
  • Deselect Activate trigger, because you haven’t configured your Lambda function yet.
  • For Batch size, enter 100.
  • For Starting position, choose Latest.
  • For Topic name¸ enter a name (for example, mytopic).
  • Choose Add.
  • On the Lambda function details page, on the Code tab, enter the following code:
import base64
import boto3
import json
import os
import random

def convertjson(payload):
    try:
        aa=json.loads(payload)
        return aa
    except:
        return 'err'

def lambda_handler(event, context):
    base64records = event['records']['mytopic-0']
    
    raw_records = [base64.b64decode(x["value"]).decode('utf-8') for x in base64records]
    
    for record in raw_records:
        item = json.loads(record)
        deviceid=item['deviceid']
        interface=item['interface']
        interfacestatus=item['interfacestatus']
        cpuusage=item['cpuusage']
        memoryusage=item['memoryusage']
        event_time=item['event_time']
        
        dynamodb = boto3.client('dynamodb')
        table_name = 'device_status'
        item = {
            'deviceid': {'S': deviceid},  
            'interface': {'S': interface},               
            'interface': {'S': interface},
            'interfacestatus': {'S': interfacestatus},
            'cpuusage': {'S': cpuusage},          
            'memoryusage': {'S': memoryusage},
            'event_time': {'S': event_time},
        }
        
        # Write the item to the DynamoDB table
        response = dynamodb.put_item(
            TableName=table_name,
            Item=item
        )
        
        print(f"Item written to DynamoDB")
  • Deploy the Lambda function.
  • On the Configuration tab, choose Edit to edit the trigger.

edit trigger

  • Select the trigger, then choose Save.
  • On the DynamoDB console, choose Explore items in the navigation pane.
  • Select the table device_status.

You will see Lambda is writing events generated in the Kafka topic to DynamoDB.

ddb table

Summary

Streaming data pipelines are critical for building real-time applications. However, setting up and managing the infrastructure can be daunting. In this post, we walked through how to build a serverless streaming pipeline on AWS using Amazon MSK, Lambda, DynamoDB, Amazon Data Firehose, and other services. The key benefits are no servers to manage, automatic scalability of the infrastructure, and a pay-as-you-go model using fully managed services.

Ready to build your own real-time pipeline? Get started today with a free AWS account. With the power of serverless, you can focus on your application logic while AWS handles the undifferentiated heavy lifting. Let’s build something awesome on AWS!


About the Authors

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Michael Oguike is a Product Manager for Amazon MSK. He is passionate about using data to uncover insights that drive action. He enjoys helping customers from a wide range of industries improve their businesses using data streaming. Michael also loves learning about behavioral science and psychology from books and podcasts.

Unlock insights on Amazon RDS for MySQL data with zero-ETL integration to Amazon Redshift

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/unlock-insights-on-amazon-rds-for-mysql-data-with-zero-etl-integration-to-amazon-redshift/

Amazon Relational Database Service (Amazon RDS) for MySQL zero-ETL integration with Amazon Redshift was announced in preview at AWS re:Invent 2023 for Amazon RDS for MySQL version 8.0.28 or higher. In this post, we provide step-by-step guidance on how to get started with near real-time operational analytics using this feature. This post is a continuation of the zero-ETL series that started with Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift.

Challenges

Customers across industries today are looking to use data to their competitive advantage and increase revenue and customer engagement by implementing near real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases:

  • Analyze the data in-place in the operational database (such as read replicas, federated query, and analytics accelerators)
  • Move the data to a data store optimized for running use case-specific queries such as a data warehouse

The zero-ETL integration is focused on simplifying the latter approach.

The extract, transform, and load (ETL) process has been a common pattern for moving data from an operational database to an analytics data warehouse. ELT is where the extracted data is loaded as is into the target first and then transformed. ETL and ELT pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL and ELT pipelines can lead to long delays, leaving data warehouse applications with stale or missing data, further leading to missed business opportunities.

Alternatively, solutions that analyze data in-place may work great for accelerating queries on a single database, but such solutions aren’t able to aggregate data from multiple operational databases for customers that need to run unified analytics.

Zero-ETL

Unlike the traditional systems where data is siloed in one database and the user has to make a trade-off between unified analysis and performance, data engineers can now replicate data from multiple RDS for MySQL databases into a single Redshift data warehouse to derive holistic insights across many applications or partitions. Updates in transactional databases are automatically and continuously propagated to Amazon Redshift so data engineers have the most recent information in near real time. There is no infrastructure to manage and the integration can automatically scale up and down based on the data volume.

At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. The following sources are currently supported for zero-ETL integrations:

When you create a zero-ETL integration for Amazon Redshift, you continue to pay for underlying source database and target Redshift database usage. Refer to Zero-ETL integration costs (Preview) for further details.

With zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing you to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can continue with your transaction processing on Amazon RDS or Amazon Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.

The following diagram illustrates this architecture.

AWS architecture diagram showcasing example zero-ETL architecture

Solution overview

Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Amazon RDS for MySQL 8.0.28 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near real time using a zero-ETL integration.

The integration is set up between Amazon RDS for MySQL (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near real time on the destination, which processes analytical queries.

You can use either the serverless option or an encrypted RA3 cluster for Amazon Redshift. For this post, we use a provisioned RDS database and a Redshift provisioned data warehouse.

The following diagram illustrates the high-level architecture.

High-level zero-ETL architecture for TICKIT data use case

The following are the steps needed to set up zero-ETL integration. These steps can be done automatically by the zero-ETL wizard, but you will require a restart if the wizard changes the setting for Amazon RDS or Amazon Redshift. You could do these steps manually, if not already configured, and perform the restarts at your convenience. For the complete getting started guides, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview) and Working with zero-ETL integrations.

  1. Configure the RDS for MySQL source with a custom DB parameter group.
  2. Configure the Redshift cluster to enable case-sensitive identifiers.
  3. Configure the required permissions.
  4. Create the zero-ETL integration.
  5. Create a database from the integration in Amazon Redshift.

Configure the RDS for MySQL source with a customized DB parameter group

To create an RDS for MySQL database, complete the following steps:

  1. On the Amazon RDS console, create a DB parameter group called zero-etl-custom-pg.

Zero-ETL integration works by using binary logs (binlogs) generated by MySQL database. To enable binlogs on Amazon RDS for MySQL, a specific set of parameters must be enabled.

  1. Set the following binlog cluster parameter settings:
    • binlog_format = ROW
    • binlog_row_image = FULL
    • binlog_checksum = NONE

In addition, make sure that the binlog_row_value_options parameter is not set to PARTIAL_JSON. By default, this parameter is not set.

  1. Choose Databases in the navigation pane, then choose Create database.
  2. For Engine Version, choose MySQL 8.0.28 (or higher).

Selected MySQL Community edition Engine version 8.0.36

  1. For Templates, select Production.
  2. For Availability and durability, select either Multi-AZ DB instance or Single DB instance (Multi-AZ DB clusters are not supported, as of this writing).
  3. For DB instance identifier, enter zero-etl-source-rms.

Selected Production template, Multi-AZ DB instance and DB instance identifier zero-etl-source-rms

  1. Under Instance configuration, select Memory optimized classes and choose the instance db.r6g.large, which should be sufficient for TICKIT use case.

Selected db.r6g.large for DB instance class under Instance configuration

  1. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg).

Selected DB parameter group zero-etl-custom-pg under Additional configuration

  1. Choose Create database.

In a couple of minutes, it should spin up an RDS for MySQL database as the source for zero-ETL integration.

RDS instance status showing as Available

Configure the Redshift destination

After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements:

  • Using an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) or Amazon Redshift Serverless
  • Encrypted (if using a provisioned cluster)

For our use case, create a Redshift cluster by completing the following steps:

  1. On the Amazon Redshift console, choose Configurations and then choose Workload management.
  2. In the parameter group section, choose Create.
  3. Create a new parameter group named zero-etl-rms.
  4. Choose Edit parameters and change the value of enable_case_sensitive_identifier to True.
  5. Choose Save.

You can also use the AWS Command Line Interface (AWS CLI) command update-workgroup for Redshift Serverless:

aws redshift-serverless update-workgroup --workgroup-name <your-workgroup-name> --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true

Cluster parameter group setup

  1. Choose Provisioned clusters dashboard.

At the top of you console window, you will see a Try new Amazon Redshift features in preview banner.

  1. Choose Create preview cluster.

Create preview cluster

  1. For Preview track, chose preview_2023.
  2. For Node type, choose one of the supported node types (for this post, we use ra3.xlplus).

Selected ra3.xlplus node type for preview cluster

  1. Under Additional configurations, expand Database configurations.
  2. For Parameter groups, choose zero-etl-rms.
  3. For Encryption, select Use AWS Key Management Service.

Database configuration showing parameter groups and encryption

  1. Choose Create cluster.

The cluster should become Available in a few minutes.

Cluster status showing as Available

  1. Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab.
  2. Choose Add authorized principals.
  3. Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations.

An account ID is stored as an ARN with root user.

Add authorized principals on the Clusters resource policy tab

  1. In the Authorized integration sources section, choose Add authorized integration source to add the ARN of the RDS for MySQL DB instance that’s the data source for the zero-ETL integration.

You can find this value by going to the Amazon RDS console and navigating to the Configuration tab of the zero-etl-source-rms DB instance.

Add authorized integration source to the Configuration tab of the zero-etl-source-rms DB instance

Your resource policy should resemble the following screenshot.

Completed resource policy setup

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for users or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform the following actions:

  • Create zero-ETL integrations for the source RDS for MySQL DB instance.
  • View and delete all zero-ETL integrations.
  • Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless clusters:
    • Provisioned arn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
    • Serverlessarn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

Complete the following steps to configure the permissions:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new policy called rds-integrations using the following JSON (replace region and account-id with your actual values):
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "rds:CreateIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:db:source-instancename",
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DescribeIntegration"
        ],
        "Resource": ["*"]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DeleteIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "redshift:CreateInboundIntegration"
        ],
        "Resource": [
            "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
        ]
    }]
}
  1. Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create zero-ETL integration.

Create zero-ETL integration on the Amazon RDS console

  1. For Integration identifier, enter a name, for example zero-etl-demo.

Enter the Integration identifier

  1. For Source database, choose Browse RDS databases and choose the source cluster zero-etl-source-rms.
  2. Choose Next.

Browse RDS databases for zero-ETL source

  1. Under Target, for Amazon Redshift data warehouse, choose Browse Redshift data warehouses and choose the Redshift data warehouse (zero-etl-target-rs).
  2. Choose Next.

Browse Redshift data warehouses for zero-ETL integration

  1. Add tags and encryption, if applicable.
  2. Choose Next.
  3. Verify the integration name, source, target, and other settings.
  4. Choose Create zero-ETL integration.

Create zero-ETL integration step 4

You can choose the integration to view the details and monitor its progress. It took about 30 minutes for the status to change from Creating to Active.

Zero-ETL integration details

The time will vary depending on the size of your dataset in the source.

Create a database from the integration in Amazon Redshift

To create your database from the zero-ETL integration, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Open the zero-etl-target-rs cluster.
  3. Choose Query data to open the query editor v2.

Query data via the Query Editor v2

  1. Connect to the Redshift data warehouse by choosing Save.

Connect to the Redshift data warehouse

  1. Obtain the integration_id from the svv_integration system table:

select integration_id from svv_integration; -- copy this result, use in the next sql

Query for integration identifier

  1. Use the integration_id from the previous step to create a new database from the integration:

CREATE DATABASE zetl_source FROM INTEGRATION '<result from above>';

Create database from integration

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time.

Analyze the near real time transactional data

Now we can run analytics on TICKIT’s operational data.

Populate the source TICKIT data

To populate the source data, complete the following steps:

  1. Copy the CSV input data files into a local directory. The following is an example command:

aws s3 cp 's3://redshift-blogs/zero-etl-integration/data/tickit' . --recursive

  1. Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process:

mysql -h <rds_db_instance_endpoint> -u admin -p password --local-infile=1

Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model

  1. Use the following CREATE TABLE commands.
  2. Load the data from local files using the LOAD DATA command.

The following is an example. Note that the input CSV file is broken into several files. This command must be run for every file if you would like to load all data. For demo purposes, a partial data load should work as well.

Create users table for demo

Analyze the source TICKIT data in the destination

On the Amazon Redshift console, open the query editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity:

SELECT * FROM SYS_INTEGRATION_ACTIVITY ORDER BY last_commit_timestamp DESC;

Query to validate the seed or CDC activity

You can now apply your business logic for transformations directly on the data that has been replicated to the data warehouse. You can also use performance optimization techniques like creating a Redshift materialized view that joins the replicated tables and other local tables to improve query performance for your analytical queries.

Monitoring

You can query the following system views and tables in Amazon Redshift to get information about your zero-ETL integrations with Amazon Redshift:

To view the integration-related metrics published to Amazon CloudWatch, open the Amazon Redshift console. Choose Zero-ETL integrations in the navigation pane and choose the integration to display activity metrics.

Zero-ETL integration activity metrics

Available metrics on the Amazon Redshift console are integration metrics and table statistics, with table statistics providing details of each table replicated from Amazon RDS for MySQL to Amazon Redshift.

Integration metrics and table statistics

Integration metrics contain table replication success and failure counts and lag details.

Integration metrics showing table replication success and failure counts and lag details. Integration metrics showing table replication success and failure counts and lag details. Integration metrics showing table replication success and failure counts and lag details.

Manual resyncs

The zero-ETL integration will automatically initiate a resync if a table sync state shows as failed or resync required. But in case the auto resync fails, you can initiate a resync at table-level granularity:

ALTER DATABASE zetl_source INTEGRATION REFRESH TABLES tbl1, tbl2;

A table can enter a failed state for multiple reasons:

  • The primary key was removed from the table. In such cases, you need to re-add the primary key and perform the previously mentioned ALTER command.
  • An invalid value is encountered during replication or a new column is added to the table with an unsupported data type. In such cases, you need to remove the column with the unsupported data type and perform the previously mentioned ALTER command.
  • An internal error, in rare cases, can cause table failure. The ALTER command should fix it.

Clean up

When you delete a zero-ETL integration, your transactional data isn’t deleted from the source RDS or the target Redshift databases, but Amazon RDS doesn’t send any new changes to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and choose Delete.
  3. To confirm the deletion, choose Delete.

delete a zero-ETL integration

Conclusion

In this post, we showed you how to set up a zero-ETL integration from Amazon RDS for MySQL to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near real time analytics on transactional and operational data.

To learn more about Amazon RDS zero-ETL integration with Amazon Redshift, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview).


 About the Authors

Milind Oke is a senior Redshift specialist solutions architect who has worked at Amazon Web Services for three years. He is an AWS-certified SA Associate, Security Specialty and Analytics Specialty certification holder, based out of Queens, New York.

Aditya Samant is a relational database industry veteran with over 2 decades of experience working with commercial and open-source databases. He currently works at Amazon Web Services as a Principal Database Specialist Solutions Architect. In his role, he spends time working with customers designing scalable, secure and robust cloud native architectures. Aditya works closely with the service teams and collaborates on designing and delivery of the new features for Amazon’s managed databases.

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/966246/

Security updates have been issued by Debian (pdns-recursor and php-dompdf-svg-lib), Fedora (grub2, libreswan, rubygem-yard, and thunderbird), Mageia (libtiff and python-scipy), Red Hat (golang, nodejs, and nodejs:16), Slackware (python3), and Ubuntu (linux, linux-azure, linux-azure-5.15, linux-azure-fde,
linux-azure-fde-5.15, linux-gcp, linux-gcp-5.15, linux-gke, linux-gkeop,
linux-gkeop-5.15, linux-hwe-5.15, linux-ibm, linux-ibm-5.15, linux-kvm,
linux-lowlatency, linux-lowlatency-hwe-5.15, linux-nvidia, linux, linux-azure, linux-gcp, linux-gcp-6.5, linux-hwe-6.5,
linux-lowlatency, linux-lowlatency-hwe-6.5, linux-oem-6.5, linux-oracle,
linux-oracle-6.5, linux-raspi, linux-starfive, linux-starfive-6.5, linux-aws, linux-aws-5.15, linux-aws, linux-aws-5.4, linux-gcp-5.4, linux-raspi, linux-raspi-5.4,
linux-xilinx-zynqmp, linux-gcp, linux-gcp-4.15, linux-kvm, linux-laptop, linux-oem-6.1, and linux-raspi).

The Experience AI Challenge: Find out all you need to know

Post Syndicated from Liz Eaton original https://www.raspberrypi.org/blog/the-experience-ai-challenge-find-out-all-you-need-to-know/

We’re really excited to see that Experience AI Challenge mentors are starting to submit AI projects created by young people. There’s still time for you to get involved in the Challenge: the submission deadline is 24 May 2024. 

The Experience AI Challenge banner.

If you want to find out more about the Challenge, join our live webinar on Wednesday 3 April at 15:30 BST on our YouTube channel.

During the webinar, you’ll have the chance to:

  • Ask your questions live. Get any Challenge-related queries answered by us in real time. Whether you need clarification on any part of the Challenge or just want advice on your young people’s project(s), this is your chance to ask.
  • Get introduced to the submission process. Understand the steps of submitting projects to the Challenge. We’ll walk you through the requirements and offer tips for making your young people’s submission stand out.
  • Learn more about our project feedback. Find out how we will deliver our personalised feedback on submitted projects (UK only).
  • Find out how we will recognise your creators’ achievements. Learn more about our showcase event taking place in July, and the certificates and posters we’re creating for you and your young people to celebrate submitting your projects.

Subscribe to our YouTube channel and press the ‘Notify me’ button to receive a notification when we go live. 

Why take part? 

The Experience AI Challenge, created by the Raspberry Pi Foundation in collaboration with Google DeepMind, guides young people under the age of 18, and their mentors, through the exciting process of creating their own unique artificial intelligence (AI) project. Participation is completely free.

Central to the Challenge is the concept of project-based learning, a hands-on approach that gets learners working together, thinking critically, and engaging deeply with the materials. 

A teacher and three students in a classroom. The teacher is pointing at a computer screen.

In the Challenge, young people are encouraged to seek out real-world problems and create possible AI-based solutions. By taking part, they become problem solvers, thinkers, and innovators. 

And to every young person based in the UK who creates a project for the Challenge, we will provide personalised feedback and a certificate of achievement, in recognition of their hard work and creativity. Any projects considered as outstanding by our experts will be selected as favourites and its creators will be invited to a showcase event in the summer. 

Resources ready for your classroom or club

You don’t need to be an AI expert to bring this Challenge to life in your classroom or coding club. Whether you’re introducing AI for the first time or looking to deepen your young people’s knowledge, the Challenge’s step-by-step resource pack covers all you and your young people need, from the basics of AI, to training a machine learning model, to creating a project in Scratch.  

In the resource pack, you will find:

  • The mentor guide contains all you need to set up and run the Challenge with your young people 
  • The creator guide supports young people throughout the Challenge and contains talking points to help with planning and designing projects 
  • The blueprint workbook helps creators keep track of their inspiration, ideas, and plans during the Challenge 

The pack offers a safety net of scaffolding, support, and troubleshooting advice. 

Find out more about the Experience AI Challenge

By bringing the Experience AI Challenge to young people, you’re inspiring the next generation of innovators, thinkers, and creators. The Challenge encourages young people to look beyond the code, to the impact of their creations, and to the possibilities of the future.

You can find out more about the Experience AI Challenge, and download the resource pack, from the Experience AI website.

The post The Experience AI Challenge: Find out all you need to know appeared first on Raspberry Pi Foundation.

Backblaze Now Available via Carahsoft’s NASPO Contract

Post Syndicated from Mary Ellen Cavanagh original https://backblazeprod.wpenginepowered.com/blog/backblaze-now-available-via-carahsofts-naspo-contract/

A decorative image showing three logos: Backblaze, carahsoft, and NASPO ValuePoint.

If you’re an IT professional for a state, local government, or educational institution or you’re a reseller serving those entities, you know firsthand how complex and time consuming procurement can truly be. Onboarding a cloud storage provider that meets both your needs and your organization’s procurement requirements is a challenge. And the need for affordable and secure data storage has never been greater—an incredible 79% of educational institutions reported being hit with ransomware in the past year. 

Today, choosing Backblaze as your preferred cloud storage provider just got a lot easier. Backblaze is now available to purchase via Carahsoft’s National Association of State Procurement Officials (NASPO) ValuePoint contract. 

The contract addition enables Carahsoft, The Trusted Government IT Solutions Provider®, and Backblaze to provide cloud storage solutions to participating states, local governments, and educational institutions. The contract also comes on the heels of Backblaze’s inclusion on Carahsoft’s NYOGS- and OMNIA-approved vendor lists.

What Is the NASPO ValuePoint Contract?

NASPO ValuePoint is a cooperative purchasing program that facilitates public procurement solicitations and agreements using a lead-state model, which means one state or organization takes the lead on soliciting proposals on behalf of others and working with a sourcing team to evaluate responses and choose a vendor. By leveraging the leadership and expertise of all states and the collective purchasing power of their public entities, NASPO ValuePoint delivers the highest valued, reliable, and competitively sourced contracts, offering public entities outstanding pricing.

Benefits to Customers

As a state, local government, or educational institution, you get a number of benefits by purchasing Backblaze through Carahsoft’s ValuePoint contract, including:

  • Simplified Procurement: You don’t have to go through the hassle of setting up your own contracts or negotiating prices. You can just use the NASPO ValuePoint contract, and that hard work is already taken care of.
  • Cost Savings: Because the ValuePoint contract covers lots of states and organizations, services are purchased in bulk, which usually means cheaper prices.
  • Time Savings: You save time researching suppliers or going through a long bidding process. You can just choose from the options already approved under the contract.
  • Quality Assurance: The contract usually has strict standards for its offerings, ensuring that you as a customer get access to quality products and services.

Benefits for Resellers

Resellers who are not currently listed on the NASPO ValuePoint Contract can still reap the benefits as well (and if you’re already listed, even better). By purchasing Backblaze through Carahsoft, you will gain:

  • Access to a Larger Market: Resellers can sell Backblaze to multiple states or organizations without having to negotiate separate contracts each time. This means more potential cloud customers.
  • Streamlined Sales Process: You don’t have to spend as much time and effort trying to win individual contracts. Now that Backblaze is on the NASPO ValuePoint Contract, you’re pre-approved to sell B2 Cloud Storage and Computer Backup to all participating entities.
  • Increased Credibility: With Backblaze being part of a trusted contract like NASPO ValuePoint, you can enhance your reputation and credibility in the cloud market, potentially attracting more customers.
  • Stable Revenue Stream: Having a contract with multiple states or organizations provides a more stable and predictable revenue stream for you as a reseller, as you have access to a broader customer base.

Making It Easier to Provision Backblaze

As a public sector agency, you face some of the greatest challenges when it comes to affordably protecting and using your data given ransomware attacks and budget constraints. Carahsoft’s NASPO program cuts through this complexity with cooperative purchasing, resulting in more favorable terms and conditions and competitive pricing. 

Previously, it was hard for many state, local, educational and government institutions to benefit from the affordability and reliability that Backblaze provides. Now, Backblaze’s addition to Carahsoft’s NASPO contract streamlines procurement of B2 Cloud Storage and Backblaze Computer Backup—speeding up your acquisition timeline.

The availability of the Backblaze portfolio to NASPO members strengthens our partnership by aligning with Government procurement processes and expanding Backblaze’s reach in the Public Sector market. It is critical we support the Government as they work to modernize their cloud data storage systems to meet the demands of an increasingly digital era. By collaborating with Backblaze and our reseller partners, we can continue to expand and improve agency access to the affordable, cutting-edge solutions they need to achieve mission success.

—John Rentz, MultiCloud Team Lead, Carahsoft

How to Purchase Backblaze via Carahsoft

Backblaze’s offerings are available through Carahsoft’s NASPO ValuePoint Master Agreement #AR2472 and OMNIA Partners Contract #R191902. To purchase, reach out to your preferred reseller or contact the Backblaze team. For more information about the NASPO ValuePoint Master Agreement, contact the Carahsoft team at [email protected].

More About Carahsoft

Carahsoft Technology Corp. is The Trusted Government IT Solutions Provider®, supporting public sector organizations across federal, state and local government agencies and education and healthcare markets. As the Master Government Aggregator® for our vendor partners, Carahsoft delivers solutions for multicloud, cybersecurity, DevSecOps, big data, artificial intelligence, open source, customer experience and engagement, and more.

The post Backblaze Now Available via Carahsoft’s NASPO Contract appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Public AI as an Alternative to Corporate AI

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/03/public-ai-as-an-alternative-to-corporate-ai.html

This mini-essay was my contribution to a round table on Power and Governance in the Age of AI.  It’s nothing I haven’t said here before, but for anyone who hasn’t read my longer essays on the topic, it’s a shorter introduction.

 

The increasingly centralized control of AI is an ominous sign. When tech billionaires and corporations steer AI, we get AI that tends to reflect the interests of tech billionaires and corporations, instead of the public. Given how transformative this technology will be for the world, this is a problem.

To benefit society as a whole we need an AI public option—not to replace corporate AI but to serve as a counterbalance—as well as stronger democratic institutions to govern all of AI. Like public roads and the federal postal system, a public AI option could guarantee universal access to this transformative technology and set an implicit standard that private services must surpass to compete.

Widely available public models and computing infrastructure would yield numerous benefits to the United States and to broader society. They would provide a mechanism for public input and oversight on the critical ethical questions facing AI development, such as whether and how to incorporate copyrighted works in model training, how to distribute access to private users when demand could outstrip cloud computing capacity, and how to license access for sensitive applications ranging from policing to medical use. This would serve as an open platform for innovation, on top of which researchers and small businesses—as well as mega-corporations—could build applications and experiment. Administered by a transparent and accountable agency, a public AI would offer greater guarantees about the availability, equitability, and sustainability of AI technology for all of society than would exclusively private AI development.

Federally funded foundation AI models would be provided as a public service, similar to a health care public option. They would not eliminate opportunities for private foundation models, but they could offer a baseline of price, quality, and ethical development practices that corporate players would have to match or exceed to compete.

The key piece of the ecosystem the government would dictate when creating an AI public option would be the design decisions involved in training and deploying AI foundation models. This is the area where transparency, political oversight, and public participation can, in principle, guarantee more democratically-aligned outcomes than an unregulated private market.

The need for such competent and faithful administration is not unique to AI, and it is not a problem we can look to AI to solve. Serious policymakers from both sides of the aisle should recognize the imperative for public-interested leaders to wrest control of the future of AI from unaccountable corporate titans. We do not need to reinvent our democracy for AI, but we do need to renovate and reinvigorate it to offer an effective alternative to corporate control that could erode our democracy.

Redis is no longer free software

Post Syndicated from corbet original https://lwn.net/Articles/966133/

The Redis in-memory database system has had
its license changed
to either the Redis Source Available
License
or the Server Side
Public License
(covered here in 2018);
neither license qualifies as free software.

Under the new license, cloud service providers hosting Redis
offerings will no longer be permitted to use the source code of
Redis free of charge. For example, cloud service providers will be
able to deliver Redis 7.4 only after agreeing to licensing terms
with Redis, the maintainers of the Redis code.

Distributors like Fedora are already looking
at
removing Redis as a consequence. (Thanks to Emmanuel Seyman).