Tag Archives: Kinesis Data Streams

Exploring real-time streaming for generative AI Applications

2024-03-25 Ali Alemi

Post Syndicated from Ali Alemi original https://aws.amazon.com/blogs/big-data/exploring-real-time-streaming-for-generative-ai-applications/

Foundation models (FMs) are large machine learning (ML) models trained on a broad spectrum of unlabeled and generalized datasets. FMs, as the name suggests, provide the foundation to build more specialized downstream applications, and are unique in their adaptability. They can perform a wide range of different tasks, such as natural language processing, classifying images, forecasting trends, analyzing sentiment, and answering questions. This scale and general-purpose adaptability are what makes FMs different from traditional ML models. FMs are multimodal; they work with different data types such as text, video, audio, and images. Large language models (LLMs) are a type of FM and are pre-trained on vast amounts of text data and typically have application uses such as text generation, intelligent chatbots, or summarization.

Streaming data facilitates the constant flow of diverse and up-to-date information, enhancing the models’ ability to adapt and generate more accurate, contextually relevant outputs. This dynamic integration of streaming data enables generative AI applications to respond promptly to changing conditions, improving their adaptability and overall performance in various tasks.

To better understand this, imagine a chatbot that helps travelers book their travel. In this scenario, the chatbot needs real-time access to airline inventory, flight status, hotel inventory, latest price changes, and more. This data usually comes from third parties, and developers need to find a way to ingest this data and process the data changes as they happen.

Batch processing is not the best fit in this scenario. When data changes rapidly, processing it in a batch may result in stale data being used by the chatbot, providing inaccurate information to the customer, which impacts the overall customer experience. Stream processing, however, can enable the chatbot to access real-time data and adapt to changes in availability and price, providing the best guidance to the customer and enhancing the customer experience.

Another example is an AI-driven observability and monitoring solution where FMs monitor real-time internal metrics of a system and produces alerts. When the model finds an anomaly or abnormal metric value, it should immediately produce an alert and notify the operator. However, the value of such important data diminishes significantly over time. These notifications should ideally be received within seconds or even while it’s happening. If operators receive these notifications minutes or hours after they happened, such an insight is not actionable and has potentially lost its value. You can find similar use cases in other industries such as retail, car manufacturing, energy, and the financial industry.

In this post, we discuss why data streaming is a crucial component of generative AI applications due to its real-time nature. We discuss the value of AWS data streaming services such as Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon Managed Service for Apache Flink, and Amazon Kinesis Data Firehose in building generative AI applications.

In-context learning

LLMs are trained with point-in-time data and have no inherent ability to access fresh data at inference time. As new data appears, you will have to continuously fine-tune or further train the model. This is not only an expensive operation, but also very limiting in practice because the rate of new data generation far supersedes the speed of fine-tuning. Additionally, LLMs lack contextual understanding and rely solely on their training data, and are therefore prone to hallucinations. This means they can generate a fluent, coherent, and syntactically sound but factually incorrect response. They are also devoid of relevance, personalization, and context.

LLMs, however, have the capacity to learn from the data they receive from the context to more accurately respond without modifying the model weights. This is called in-context learning, and can be used to produce personalized answers or provide an accurate response in the context of organization policies.

For example, in a chatbot, data events could pertain to an inventory of flights and hotels or price changes that are constantly ingested to a streaming storage engine. Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. The snapshot constantly updates through stream processing; therefore, the up-to-date data is provided in the context of a user prompt to the model. This allows the model to adapt to the latest changes in price and availability. The following diagram illustrates a basic in-context learning workflow.

A commonly used in-context learning approach is to use a technique called Retrieval Augmented Generation (RAG). In RAG, you provide the relevant information such as most relevant policy and customer records along with the user question to the prompt. This way, the LLM generates an answer to the user question using additional information provided as context. To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base. This solution falls short when a near-real-time personalized response is expected from the application. For example, a travel chatbot is expected to consider the user’s current bookings, available hotel and flight inventory, and more. Moreover, the relevant customer personal data (commonly known as the unified customer profile) is usually subject to change. If a batch process is employed to update the generative AI’s user profile database, the customer may receive dissatisfying responses based on old data.

In this post, we discuss the application of stream processing to enhance a RAG solution used for building question answering agents with context from real-time access to unified customer profiles and organizational knowledge base.

Near-real-time customer profile updates

Customer records are typically distributed across data stores within an organization. For your generative AI application to provide a relevant, accurate, and up-to-date customer profile, it is vital to build streaming data pipelines that can perform identity resolution and profile aggregation across the distributed data stores. Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. Change data capture (CDC) events contain information about the source record, updates, and metadata such as time, source, classification (insert, update, or delete), and the initiator of the change.

The following diagram illustrates an example workflow for CDC streaming ingestion and processing for unified customer profiles.

In this section, we discuss the main components of a CDC streaming pattern required to support RAG-based generative AI applications.

CDC streaming ingestion

A CDC replicator is a process that collects data changes from a source system (usually by reading transaction logs or binlogs) and writes CDC events with the exact same order they occurred in a streaming data stream or topic. This involves a log-based capture with tools such as AWS Database Migration Service (AWS DMS) or open source connectors such as Debezium for Apache Kafka connect. Apache Kafka Connect is part of the Apache Kafka environment, allowing data to be ingested from various sources and delivered to variety of destinations. You can run your Apache Kafka connector on Amazon MSK Connect within minutes without worrying about configuration, setup, and operating an Apache Kafka cluster. You only need to upload your connector’s compiled code to Amazon Simple Storage Service (Amazon S3) and set up your connector with your workload’s specific configuration.

There are also other methods for capturing data changes. For example, Amazon DynamoDB provides a feature for streaming CDC data to Amazon DynamoDB Streams or Kinesis Data Streams. Amazon S3 provides a trigger to invoke an AWS Lambda function when a new document is stored.

Streaming storage

Streaming storage functions as an intermediate buffer to store CDC events before they get processed. Streaming storage provides reliable storage for streaming data. By design, it is highly available and resilient to hardware or node failures and maintains the order of the events as they are written. Streaming storage can store data events either permanently or for a set period of time. This allows stream processors to read from part of the stream if there is a failure or a need for re-processing. Kinesis Data Streams is a serverless streaming data service that makes it straightforward to capture, process, and store data streams at scale. Amazon MSK is a fully managed, highly available, and secure service provided by AWS for running Apache Kafka.

Stream processing

Stream processing systems should be designed for parallelism to handle high data throughput. They should partition the input stream between multiple tasks running on multiple compute nodes. Tasks should be able to send the result of one operation to the next one over the network, making it possible for processing data in parallel while performing operations such as joins, filtering, enrichment, and aggregations. Stream processing applications should be able to process events with regards to the event time for use cases where events could arrive late or correct computation relies on the time events occur rather than the system time. For more information, refer to Notions of Time: Event Time and Processing Time.

Stream processes continuously produce results in the form of data events that need to be output to a target system. A target system could be any system that can integrate directly with the process or via streaming storage as in intermediary. Depending on the framework you choose for stream processing, you will have different options for target systems depending on available sink connectors. If you decide to write the results to an intermediary streaming storage, you can build a separate process that reads events and applies changes to the target system, such as running an Apache Kafka sink connector. Regardless of which option you choose, CDC data needs extra handling due to its nature. Because CDC events carry information about updates or deletes, it’s important that they merge in the target system in the right order. If changes are applied in the wrong order, the target system will be out of sync with its source.

Apache Flink is a powerful stream processing framework known for its low latency and high throughput capabilities. It supports event time processing, exactly-once processing semantics, and high fault tolerance. Additionally, it provides native support for CDC data via a special structure called dynamic tables. Dynamic tables mimic the source database tables and provide a columnar representation of the streaming data. The data in dynamic tables changes with every event that is processed. New records can be appended, updated, or deleted at any time. Dynamic tables abstract away the extra logic you need to implement for each record operation (insert, update, delete) separately. For more information, refer to Dynamic Tables.

With Amazon Managed Service for Apache Flink, you can run Apache Flink jobs and integrate with other AWS services. There are no servers and clusters to manage, and there is no compute and storage infrastructure to set up.

AWS Glue is a fully managed extract, transform, and load (ETL) service, which means AWS handles the infrastructure provisioning, scaling, and maintenance for you. Although it’s primarily known for its ETL capabilities, AWS Glue can also be used for Spark streaming applications. AWS Glue can interact with streaming data services such as Kinesis Data Streams and Amazon MSK for processing and transforming CDC data. AWS Glue can also seamlessly integrate with other AWS services such as Lambda, AWS Step Functions, and DynamoDB, providing you with a comprehensive ecosystem for building and managing data processing pipelines.

Unified customer profile

Overcoming the unification of the customer profile across a variety of source systems requires the development of robust data pipelines. You need data pipelines that can bring and synchronize all records into one data store. This data store provides your organization with the holistic customer records view that is needed for operational efficiency of RAG-based generative AI applications. For building such a data store, an unstructured data store would be best.

An identity graph is a useful structure for creating a unified customer profile because it consolidates and integrates customer data from various sources, ensures data accuracy and deduplication, offers real-time updates, connects cross-systems insights, enables personalization, enhances customer experience, and supports regulatory compliance. This unified customer profile empowers the generative AI application to understand and engage with customers effectively, and adhere to data privacy regulations, ultimately enhancing customer experiences and driving business growth. You can build your identity graph solution using Amazon Neptune, a fast, reliable, fully managed graph database service.

AWS provides a few other managed and serverless NoSQL storage service offerings for unstructured key-value objects. Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed enterprise document database service that supports native JSON workloads. DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

Near-real-time organizational knowledge base updates

Similar to customer records, internal knowledge repositories such as company policies and organizational documents are siloed across storage systems. This is typically unstructured data and is updated in a non-incremental fashion. The use of unstructured data for AI applications is effective using vector embeddings, which is a technique of representing high dimensional data such as text files, images, and audio files as multi-dimensional numeric.

AWS provides several vector engine services, such as Amazon OpenSearch Serverless, Amazon Kendra, and Amazon Aurora PostgreSQL-Compatible Edition with the pgvector extension for storing vector embeddings. Generative AI applications can enhance the user experience by transforming the user prompt into a vector and use it to query the vector engine to retrieve contextually relevant information. Both the prompt and the vector data retrieved are then passed to the LLM to receive a more precise and personalized response.

The following diagram illustrates an example stream-processing workflow for vector embeddings.

Knowledge base contents need to be converted to vector embeddings before being written to the vector data store. Amazon Bedrock or Amazon SageMaker can help you access the model of your choice and expose a private endpoint for this conversion. Furthermore, you can use libraries such as LangChain to integrate with these endpoints. Building a batch process can help you convert your knowledge base content to vector data and store it in a vector database initially. However, you need to rely on an interval to reprocess the documents to synchronize your vector database with changes in your knowledge base content. With a large number of documents, this process can be inefficient. Between these intervals, your generative AI application users will receive answers according to the old content, or will receive an inaccurate answer because the new content is not vectorized yet.

Stream processing is an ideal solution for these challenges. It produces events as per existing documents initially and further monitors the source system and creates a document change event as soon as they occur. These events can be stored in streaming storage and wait to be processed by a streaming job. A streaming job reads these events, loads the content of the document, and transforms the contents to an array of related tokens of words. Each token further transforms into vector data via an API call to an embedding FM. Results are sent for storage to the vector storage via a sink operator.

If you’re using Amazon S3 for storing your documents, you can build an event-source architecture based on S3 object change triggers for Lambda. A Lambda function can create an event in the desired format and write that to your streaming storage.

You can also use Apache Flink to run as a streaming job. Apache Flink provides the native FileSystem source connector, which can discover existing files and read their contents initially. After that, it can continuously monitor your file system for new files and capture their content. The connector supports reading a set of files from distributed file systems such as Amazon S3 or HDFS with a format of plain text, Avro, CSV, Parquet, and more, and produces a streaming record. As a fully managed service, Managed Service for Apache Flink removes the operational overhead of deploying and maintaining Flink jobs, allowing you to focus on building and scaling your streaming applications. With seamless integration into the AWS streaming services such as Amazon MSK or Kinesis Data Streams, it provides features like automatic scaling, security, and resiliency, providing reliable and efficient Flink applications for handling real-time streaming data.

Based on your DevOps preference, you can choose between Kinesis Data Streams or Amazon MSK for storing the streaming records. Kinesis Data Streams simplifies the complexities of building and managing custom streaming data applications, allowing you to focus on deriving insights from your data rather than infrastructure maintenance. Customers using Apache Kafka often opt for Amazon MSK due to its straightforwardness, scalability, and dependability in overseeing Apache Kafka clusters within the AWS environment. As a fully managed service, Amazon MSK takes on the operational complexities associated with deploying and maintaining Apache Kafka clusters, enabling you to concentrate on constructing and expanding your streaming applications.

Because a RESTful API integration suits the nature of this process, you need a framework that supports a stateful enrichment pattern via RESTful API calls to track for failures and retry for the failed request. Apache Flink again is a framework that can do stateful operations in at-memory speed. To understand the best ways to make API calls via Apache Flink, refer to Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink.

Apache Flink provides native sink connectors for writing data to vector datastores such as Amazon Aurora for PostgreSQL with pgvector or Amazon OpenSearch Service with VectorDB. Alternatively, you can stage the Flink job’s output (vectorized data) in an MSK topic or a Kinesis data stream. OpenSearch Service provides support for native ingestion from Kinesis data streams or MSK topics. For more information, refer to Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion and Loading streaming data from Amazon Kinesis Data Streams.

Feedback analytics and fine-tuning

It’s important for data operation managers and AI/ML developers to get insight about the performance of the generative AI application and the FMs in use. To achieve that, you need to build data pipelines that calculate important key performance indicator (KPI) data based on the user feedback and variety of application logs and metrics. This information is useful for stakeholders to gain real-time insight about the performance of the FM, the application, and overall user satisfaction about the quality of support they receive from your application. You also need to collect and store the conversation history for further fine-tuning your FMs to improve their ability in performing domain-specific tasks.

This use case fits very well in the streaming analytics domain. Your application should store each conversation in streaming storage. Your application can prompt users about their rating of each answer’s accuracy and their overall satisfaction. This data can be in a format of a binary choice or a free form text. This data can be stored in a Kinesis data stream or MSK topic, and get processed to generate KPIs in real time. You can put FMs to work for users’ sentiment analysis. FMs can analyze each answer and assign a category of user satisfaction.

Apache Flink’s architecture allows for complex data aggregation over windows of time. It also provides support for SQL querying over stream of data events. Therefore, by using Apache Flink, you can quickly analyze raw user inputs and generate KPIs in real time by writing familiar SQL queries. For more information, refer to Table API & SQL.

With Amazon Managed Service for Apache Flink Studio, you can build and run Apache Flink stream processing applications using standard SQL, Python, and Scala in an interactive notebook. Studio notebooks are powered by Apache Zeppelin and use Apache Flink as the stream processing engine. Studio notebooks seamlessly combine these technologies to make advanced analytics on data streams accessible to developers of all skill sets. With support for user-defined functions (UDFs), Apache Flink allows for building custom operators to integrate with external resources such as FMs for performing complex tasks such as sentiment analysis. You can use UDFs to compute various metrics or enrich user feedback raw data with additional insights such as user sentiment. To learn more about this pattern, refer to Proactively addressing customer concern in real-time with GenAI, Flink, Apache Kafka, and Kinesis.

With Managed Service for Apache Flink Studio, you can deploy your Studio notebook as a streaming job with one click. You can use native sink connectors provided by Apache Flink to send the output to your storage of choice or stage it in a Kinesis data stream or MSK topic. Amazon Redshift and OpenSearch Service are both ideal for storing analytical data. Both engines provide native ingestion support from Kinesis Data Streams and Amazon MSK via a separate streaming pipeline to a data lake or data warehouse for analysis.

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses and data lakes, using AWS-designed hardware and machine learning to deliver the best price-performance at scale. OpenSearch Service offers visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

You can use the outcome of such analysis combined with user prompt data for fine-tuning the FM when is needed. SageMaker is the most straightforward way to fine-tune your FMs. Using Amazon S3 with SageMaker provides a powerful and seamless integration for fine-tuning your models. Amazon S3 serves as a scalable and durable object storage solution, enabling straightforward storage and retrieval of large datasets, training data, and model artifacts. SageMaker is a fully managed ML service that simplifies the entire ML lifecycle. By using Amazon S3 as the storage backend for SageMaker, you can benefit from the scalability, reliability, and cost-effectiveness of Amazon S3, while seamlessly integrating it with SageMaker training and deployment capabilities. This combination enables efficient data management, facilitates collaborative model development, and makes sure that ML workflows are streamlined and scalable, ultimately enhancing the overall agility and performance of the ML process. For more information, refer to Fine-tune Falcon 7B and other LLMs on Amazon SageMaker with @remote decorator.

With a file system sink connector, Apache Flink jobs can deliver data to Amazon S3 in open format (such as JSON, Avro, Parquet, and more) files as data objects. If you prefer to manage your data lake using a transactional data lake framework (such as Apache Hudi, Apache Iceberg, or Delta Lake), all of these frameworks provide a custom connector for Apache Flink. For more details, refer to Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi.

Summary

For a generative AI application based on a RAG model, you need to consider building two data storage systems, and you need to build data operations that keep them up to date with all the source systems. Traditional batch jobs are not sufficient to process the size and diversity of the data you need to integrate with your generative AI application. Delays in processing the changes in source systems result in an inaccurate response and reduce the efficiency of your generative AI application. Data streaming enables you to ingest data from a variety of databases across various systems. It also allows you to transform, enrich, join, and aggregate data across many sources efficiently in near-real time. Data streaming provides a simplified data architecture to collect and transform users’ real-time reactions or comments on the application responses, helping you deliver and store the results in a data lake for model fine-tuning. Data streaming also helps you optimize data pipelines by processing only the change events, allowing you to respond to data changes more quickly and efficiently.

Learn more about AWS data streaming services and get started building your own data streaming solution.

About the Authors

Ali Alemi is a Streaming Specialist Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems which are reliable, secure, efficient, and cost-effective. He works backward from customer’s use cases and designs data solutions to solve their business problems. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the Cloud.

Imtiaz (Taz) Sayed is the World-Wide Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached via LinkedIn.

Invoke AWS Lambda functions from cross-account Amazon Kinesis Data Streams

2024-03-20 Amar Surjit

Post Syndicated from Amar Surjit original https://aws.amazon.com/blogs/big-data/invoke-aws-lambda-functions-from-cross-account-amazon-kinesis-data-streams/

A multi-account architecture on AWS is essential for enhancing security, compliance, and resource management by isolating workloads, enabling granular cost allocation, and facilitating collaboration across distinct environments. It also mitigates risks, improves scalability, and allows for advanced networking configurations.

In a streaming architecture, you may have event producers, stream storage, and event consumers in a single account or spread across different accounts depending on your business and IT requirements. For example, your company may want to centralize its clickstream data or log data from multiple different producers across different accounts. Data consumers from marketing, product engineering, or analytics require access to the same streaming data across accounts, which requires the ability to deliver a multi-account streaming architecture.

To build a multi-account streaming architecture, you can use Amazon Kinesis Data Streams as the stream storage and AWS Lambda as the event consumer. Amazon Kinesis Data Streams enables real-time processing of streaming data at scale. When integrated with Lambda, it allows for serverless data processing, enabling you to analyze and react to data streams in real time without managing infrastructure. This integration supports various use cases, including real-time analytics, log processing, Internet of Things (IoT) data ingestion, and more, making it valuable for businesses requiring timely insights from their streaming data. In this post, we demonstrate how you can process data ingested into a stream in one account with a Lambda function in another account.

The recent launch of Kinesis Data Streams support for resource-based policies enables invoking a Lambda from another account. With a resource-based policy, you can specify AWS accounts, AWS Identity and Access Management (IAM) users, or IAM roles and the exact Kinesis Data Streams actions for which you want to grant access. After access is granted, you can configure a Lambda function in another account to start processing the data stream belonging to your account. This reduces cost and simplifies the data processing pipeline, because you no longer have to copy streaming data using Lambda functions in both accounts. Sharing access to your data streams or registered consumers does not incur additional charges to your account. Cross-account usage of Kinesis Data Streams resources will continue to be billed to the resource owners.

In this post, we use Kinesis Data Streams with enhanced fan-out feature, empowering consumers with dedicated read throughput tailored to their applications. By default, Kinesis Data Streams offers shared read throughput of 2 MB/sec per shard across consumers, but with enhanced fan-out, each consumer can enjoy dedicated throughput of 2 MB/sec per shard. This flexibility allows you to seamlessly adapt Kinesis Data Streams to your specific requirements, choosing between enhanced fan-out for dedicated throughput or shared throughput according to your needs.

Solution overview

For our solution, we deploy Kinesis Data Streams in Account 1 and Lambda as the consumer in Account 2 to receive data from the data stream. The following diagram illustrates the high-level architecture.

Amazon KDS-Lambda cross acct solution architecture

The setup requires the following key elements:

Kinesis data stream in Account 1 and Lambda function in Account 2
Kinesis Data Streams resource policies in Account 1, allowing a cross-account Lambda execution role to perform operations on the Kinesis data stream
A Lambda execution role in Account 2 and an enhanced fan-out consumer resource policy in Account 1, allowing the cross-account Lambda execution role to perform operations on the Kinesis data stream

For the setup, you use three AWS CloudFormation templates to create the key resources:

CloudFormation template 1 creates the following key resources in Account 1:
- Kinesis data stream
- Kinesis data stream enhanced fan-out consumer
CloudFormation template 2 creates the following key resources in Account 2:
- Consumer Lambda function
- Consumer Lambda function execution role
CloudFormation template 3 creates the following resource in Account 2:
- Consumer Lambda function event source mapping

The solution supports single-Region deployment, and the CloudFormation templates must be deployed in the same Region across different AWS accounts. In this solution, we use Kinesis Data Streams enhanced fan-out, which is a best practice for deploying architectures requiring large throughput across multiple consumers. Complete the steps in the following sections to deploy this solution.

Prerequisites

You should have two AWS accounts and the required permissions to run a CloudFormation template to create the services mentioned in the solution architecture. You also need the AWS Command Line Interface (AWS CLI) installed, version 2.15 and above.

Launch CloudFormation template 1

Complete the following steps to launch the first CloudFormation template:

Sign in to the AWS Management Console as Account 1 and select the appropriate AWS Region.
Download and launch CloudFormation template 1 where you want to deploy your Kinesis data stream.
For LambdaConsumerAccountId, enter your Lambda consumer account ID and click submit. The CloudFormation template deployment will take a few minutes to complete.
When the stack is complete, on the AWS CloudFormation console, navigate to the stack Outputs tab and copy the values of following parameters:
- KinesisStreamArn
- KinesisStreamEFOConsumerArn
- KMSKeyArn

You will need these values in later steps.

Launch CloudFormation template 2

Complete the following steps to launch the second CloudFormation template:

Sign in to the console as Account 2 and select the appropriate Region.
Download and launch CloudFormation template 2 where you want to host the Lambda consumer.
Provide the following input parameters captured from the previous step:
- KinesisStreamArn
- KinesisStreamEFOConsumerArn
- KMSKeyArn

The CloudFormation template creates the following key resources:

Lambda consumer
Lambda execution role

The Lambda function’s execution role is an IAM role that grants the function permission to access AWS services and resources. Here, you create a Lambda execution role that has the required Kinesis Data Streams and Lambda invocation permissions.

The CloudFormation template deployment will take a few minutes to complete.

When the stack is complete, on the AWS CloudFormation console, navigate to the stack Outputs tab and copy the values of following parameters:
- KinesisStreamCreateResourcePolicyCommand
- KinesisStreamEFOConsumerCreateResourcePolicyCommand

Run the following AWS CLI commands in Account 1 using AWS CloudShell. We recommend using CloudShell because it will have the latest version of the AWS CLI and avoid any kind of failures.

KinesisStreamCreateResourcePolicyCommand – This creates the resource policy in Account 1 for Kinesis Data Stream. The following is a sample resource policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StreamEFOReadStatementID",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<AWS Lambda - Consumer account id>:role/kds-cross-account-stream-consumer-lambda-execution-role"
]
},
"Action": [
"kinesis:DescribeStreamSummary",
"kinesis:ListShards",
"kinesis:DescribeStream",
"kinesis:GetRecords",
"kinesis:GetShardIterator"
],
"Resource": "arn:aws:kinesis:<region id>:<Account 1 - Amazon KDS account id>:stream/kds-cross-account-stream"
}
]
}

KinesisStreamEFOConsumerCreateResourcePolicyCommand – This creates the resource policy for the enhanced fan-out consumer for the Kinesis data stream in Account 1. The following is a sample resource policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ConsumerEFOReadStatementID",
"Effect": "Allow",
"Principal": {
"AWS": [
" arn:aws:iam::<AWS Lambda - Consumer account id>:role/kds-cross-account-stream-consumer-lambda-execution-role"
]
},
"Action": [
"kinesis:DescribeStreamConsumer",
"kinesis:SubscribeToShard"
],
"Resource": "arn:aws:kinesis:<region id>:<Account 1 - Amazon KDS account id>:stream/kds-cross-account-stream/consumer/kds-cross-account-stream-efo-consumer:1706616477"
}
]
}

You can also access this policy on the Kinesis Data Streams console, under Enhanced fan-out, Consumer name, and Consumer sharing resource-based policy.

Launch CloudFormation template 3

Now that you have created resource policies in Account 1 for the Kinesis data stream and its enhanced fan-out consumer, you can create Lambda event source mapping for the consumer Lambda function in Account 2. Complete the following steps:

Sign in to the console as Account 2 and select the appropriate Region.
Download and launch CloudFormation template 3 to update the stack you created using CloudFormation template 2.

The CloudFormation template creates the Lambda event source mapping.

Validate the solution

At this point, the deployment is complete. A Kinesis data stream is available to consume the messages and a Lambda function receives these messages in the destination account. To send sample messages to the data stream in Account 1, run the following AWS CLI command using CloudShell:

aws kinesis put-record --stream-name kds-cross-account-stream --data sampledatarecord --partition-key samplepartitionkey3 --region <region id>

The Lambda function in Account 2 is able to receive the messages, and you should be able to verify the same using Amazon CloudWatch logs:

On the CloudWatch console, choose Log groups in the navigation pane.
Locate the log group /aws/lambda/kds-cross-account-stream-efo-consumer.

Choose Search log group to view the relevant log messages. The following is an example message:

"Records": [
{
"kinesis": {
"kinesisSchemaVersion": "1.0",
"partitionKey": "samplepartitionkey3",
"sequenceNumber": "49648798411111169765201534322676841348246990356337393698",
"data": "sampledatarecord",
"approximateArrivalTimestamp": 1706623274.658
},

Clean up

It’s always a good practice to clean up all the resources you created as part of this post to avoid any additional cost.

To clean up your resources, delete the respective CloudFormation stacks from Accounts 1 and 2, and stop the producer from pushing events to the Kinesis data stream. This makes sure that you are not charged unnecessarily.

Summary

In this post, we demonstrated how to configure a cross-account Lambda integration with Kinesis Data Streams using AWS resource-based policies. This enables processing of data ingested into a stream within one AWS account through a Lambda function located in another account. To support customers who use a Kinesis data stream in their central account and have multiple consumers reading data from it, we have used the Kinesis Data Streams enhanced fan-out feature.

To get started, open the Kinesis Data Streams console or use the new API PutResourcePolicy to attach a resource policy to your data stream or consumer.

About the authors

Pratik Patel is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Amar is a Senior Solutions Architect at Amazon AWS in the UK. He works across power, utilities, manufacturing and automotive customers on strategic implementations, specializing in using AWS Streaming and advanced data analytics solutions, to drive optimal business outcomes.

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

2024-01-08 Raghavarao Sodabathina

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/architectural-patterns-for-real-time-analytics-using-amazon-kinesis-data-streams-part-1/

We’re living in the age of real-time data and insights, driven by low-latency data streaming applications. Today, everyone expects a personalized experience in any application, and organizations are constantly innovating to increase their speed of business operation and decision making. The volume of time-sensitive data produced is increasing rapidly, with different formats of data being introduced across new businesses and customer use cases. Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences.

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services.

In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices. In the subsequent post in our series, we will explore the architectural patterns in building streaming pipelines for real-time BI dashboards, contact center agent, ledger data, personalized real-time recommendation, log analytics, IoT data, Change Data Capture, and real-time marketing data. All these architecture patterns are integrated with Amazon Kinesis Data Streams.

Real-time streaming with Kinesis Data Streams

Amazon Kinesis Data Streams is a cloud-native, serverless streaming data service that makes it easy to capture, process, and store real-time data at any scale. With Kinesis Data Streams, you can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time. The collected data is available in milliseconds to allow real-time analytics use cases, such as real-time dashboards, real-time anomaly detection, and dynamic pricing. By default, the data within the Kinesis Data Stream is stored for 24 hours with an option to increase the data retention to 365 days. If customers want to process the same data in real-time with multiple applications, then they can use the Enhanced Fan-Out (EFO) feature. Prior to this feature, every application consuming data from the stream shared the 2MB/second/shard output. By configuring stream consumers to use enhanced fan-out, each data consumer receives dedicated 2MB/second pipe of read throughput per shard to further reduce the latency in data retrieval.

For high availability and durability, Kinesis Data Streams achieves high durability by synchronously replicating the streamed data across three Availability Zones in an AWS Region and gives you the option to retain data for up to 365 days. For security, Kinesis Data Streams provide server-side encryption so you can meet strict data management requirements by encrypting your data at rest and Amazon Virtual Private Cloud (VPC) interface endpoints to keep traffic between your Amazon VPC and Kinesis Data Streams private.

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details.

Modern data streaming architecture with Kinesis Data Streams

A modern streaming data architecture with Kinesis Data Streams can be designed as a stack of five logical layers; each layer is composed of multiple purpose-built components that address specific requirements, as illustrated in the following diagram:

The architecture consists of the following key components:

Streaming sources – Your source of streaming data includes data sources like clickstream data, sensors, social media, Internet of Things (IoT) devices, log files generated by using your web and mobile applications, and mobile devices that generate semi-structured and unstructured data as continuous streams at high velocity.
Stream ingestion – The stream ingestion layer is responsible for ingesting data into the stream storage layer. It provides the ability to collect data from tens of thousands of data sources and ingest in real time. You can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, or a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams. In addition, you can use many pre-build integrations such as AWS Database Migration Service (AWS DMS), Amazon DynamoDB, and AWS IoT Core to ingest data in a no-code fashion. You can also ingest data from third-party platforms such as Apache Spark and Apache Kafka Connect
Stream storage – Kinesis Data Streams offer two modes to support the data throughput: On-Demand and Provisioned. On-Demand mode, now the default choice, can elastically scale to absorb variable throughputs, so that customers do not need to worry about capacity management and pay by data throughput. The On-Demand mode automatically scales up 2x the stream capacity over its historic maximum data ingestion to provide sufficient capacity for unexpected spikes in data ingestion. Alternatively, customers who want granular control over stream resources can use the Provisioned mode and proactively scale up and down the number of Shards to meet their throughput requirements. Additionally, Kinesis Data Streams can store streaming data up to 24 hours by default, but can extend to 7 days or 365 days depending upon use cases. Multiple applications can consume the same stream.
Stream processing – The stream processing layer is responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. The streaming records are read in the order they are produced, allowing for real-time analytics, building event-driven applications or streaming ETL (extract, transform, and load). You can use Amazon Managed Service for Apache Flink for complex stream data processing, AWS Lambda for stateless stream data processing, and AWS Glue & Amazon EMR for near-real-time compute. You can also build customized consumer applications with Kinesis Consumer Library, which will take care of many complex tasks associated with distributed computing.
Destination – The destination layer is like a purpose-built destination depending on your use case. You can stream data directly to Amazon Redshift for data warehousing and Amazon EventBridge for building event-driven applications. You can also use Amazon Kinesis Data Firehose for streaming integration where you can light stream processing with AWS Lambda, and then deliver processed streaming into destinations like Amazon S3 data lake, OpenSearch Service for operational analytics, a Redshift data warehouse, No-SQL databases like Amazon DynamoDB, and relational databases like Amazon RDS to consume real-time streams into business applications. The destination can be an event-driven application for real-time dashboards, automatic decisions based on processed streaming data, real-time altering, and more.

Real-time analytics architecture for time series

Time series data is a sequence of data points recorded over a time interval for measuring events that change over time. Examples are stock prices over time, webpage clickstreams, and device logs over time. Customers can use time series data to monitor changes over time, so that they can detect anomalies, identify patterns, and analyze how certain variables are influenced over time. Time series data is typically generated from multiple sources in high volumes, and it needs to be cost-effectively collected in near real time.

Typically, there are three primary goals that customers want to achieve in processing time-series data:

Gain insights real-time into system performance and detect anomalies
Understand end-user behavior to track trends and query/build visualizations from these insights
Have a durable storage solution to ingest and store both archival and frequently accessed data.

With Kinesis Data Streams, customers can continuously capture terabytes of time series data from thousands of sources for cleaning, enrichment, storage, analysis, and visualization.

The following architecture pattern illustrates how real time analytics can be achieved for Time Series data with Kinesis Data Streams:

Build a serverless streaming data pipeline for time series data

The workflow steps are as follows:

Data Ingestion & Storage – Kinesis Data Streams can continuously capture and store terabytes of data from thousands of sources.
Stream Processing – An application created with Amazon Managed Service for Apache Flink can read the records from the data stream to detect and clean any errors in the time series data and enrich the data with specific metadata to optimize operational analytics. Using a data stream in the middle provides the advantage of using the time series data in other processes and solutions at the same time. A Lambda function is then invoked with these events, and can perform time series calculations in memory.
Destinations – After cleaning and enrichment, the processed time series data can be streamed to Amazon Timestream database for real-time dashboarding and analysis, or stored in databases such as DynamoDB for end-user query. The raw data can be streamed to Amazon S3 for archiving.
Visualization & Gain insights – Customers can query, visualize, and create alerts using Amazon Managed Service for Grafana. Grafana supports data sources that are storage backends for time series data. To access your data from Timestream, you need to install the Timestream plugin for Grafana. End-users can query data from the DynamoDB table with Amazon API Gateway acting as a proxy.

Refer to Near Real-Time Processing with Amazon Kinesis, Amazon Timestream, and Grafana showcasing a serverless streaming pipeline to process and store device telemetry IoT data into a time series optimized data store such as Amazon Timestream.

Enriching & replaying data in real time for event-sourcing microservices

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. When building event-driven microservices, customers want to achieve 1. high scalability to handle the volume of incoming events and 2. reliability of event processing and maintain system functionality in the face of failures.

Customers utilize microservice architecture patterns to accelerate innovation and time-to-market for new features, because it makes applications easier to scale and faster to develop. However, it is challenging to enrich and replay the data in a network call to another microservice because it can impact the reliability of the application and make it difficult to debug and trace errors. To solve this problem, event-sourcing is an effective design pattern that centralizes historic records of all state changes for enrichment and replay, and decouples read from write workloads. Customers can use Kinesis Data Streams as the centralized event store for event-sourcing microservices, because KDS can 1/ handle gigabytes of data throughput per second per stream and stream the data in milliseconds, to meet the requirement on high scalability and near real-time latency, 2/ integrate with Flink and S3 for data enrichment and achieving while being completely decoupled from the microservices, and 3/ allow retry and asynchronous read in a later time, because KDS retains the data record for a default of 24 hours, and optionally up to 365 days.

The following architectural pattern is a generic illustration of how Kinesis Data Streams can be used for Event-Sourcing Microservices:

The steps in the workflow are as follows:

Data Ingestion and Storage – You can aggregate the input from your microservices to your Kinesis Data Streams for storage.
Stream processing – Apache Flink Stateful Functions simplifies building distributed stateful event-driven applications. It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. You can create a stateful functions cluster with Apache Flink based on your application business logic.
State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking.
Output streams – The output streams can be consumed through Lambda remote functions through HTTP/gRPC protocol through API Gateway.
Lambda remote functions – Lambda functions can act as microservices for various application and business logic to serve business applications and mobile apps.

To learn how other customers built their event-based microservices with Kinesis Data Streams, refer to the following:

Key considerations and best practices

The following are considerations and best practices to keep in mind:

Data discovery should be your first step in building modern data streaming applications. You must define the business value and then identify your streaming data sources and user personas to achieve the desired business outcomes.
Choose your streaming data ingestion tool based on your steaming data source. For example, you can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams, AWS DMS for CDC streaming use cases, and AWS IoT Core for ingesting IoT device data into Kinesis Data Streams. You can ingest streaming data directly into Amazon Redshift to build low-latency streaming applications. You can also use third-party libraries like Apache Spark and Apache Kafka to ingest streaming data into Kinesis Data Streams.
You need to choose your streaming data processing services based on your specific use case and business requirements. For example, you can use Amazon Kinesis Managed Service for Apache Flink for advanced streaming use cases with multiple streaming destinations and complex stateful stream processing or if you want to monitor business metrics in real time (such as every hour). Lambda is good for event-based and stateless processing. You can use Amazon EMR for streaming data processing to use your favorite open source big data frameworks. AWS Glue is good for near-real-time streaming data processing for use cases such as streaming ETL.
Kinesis Data Streams on-demand mode charges by usage and automatically scales up resource capacity, so it’s good for spiky streaming workloads and hands-free maintenance. Provisioned mode charges by capacity and requires proactive capacity management, so it’s good for predictable streaming workloads.
You can use the Kinesis Shared Calculator to calculate the number of shards needed for provisioned mode. You don’t need to be concerned about shards with on-demand mode.
When granting permissions, you decide who is getting what permissions to which Kinesis Data Streams resources. You enable specific actions that you want to allow on those resources. Therefore, you should grant only the permissions that are required to perform a task. You can also encrypt the data at rest by using a KMS customer managed key (CMK).
You can update the retention period via the Kinesis Data Streams console or by using the IncreaseStreamRetentionPeriod and the DecreaseStreamRetentionPeriod operations based on your specific use cases.
Kinesis Data Streams supports resharding. The recommended API for this function is UpdateShardCount, which allows you to modify the number of shards in your stream to adapt to changes in the rate of data flow through the stream. The resharding APIs (Split and Merge) are typically used to handle hot shards.

Conclusion

This post demonstrated various architectural patterns for building low-latency streaming applications with Kinesis Data Streams. You can build your own low-latency steaming applications with Kinesis Data Streams using the information in this post.

For detailed architectural patterns, refer to the following resources:

If you want to build a data vision and strategy, check out the AWS Data-Driven Everything (D2E) program.

About the Authors

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Hang Zuo is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Shwetha Radhakrishnan is a Solutions Architect for AWS with a focus in Data Analytics. She has been building solutions that drive cloud adoption and help organizations make data-driven decisions within the public sector. Outside of work, she loves dancing, spending time with friends and family, and traveling.

Brittany Ly is a Solutions Architect at AWS. She is focused on helping enterprise customers with their cloud adoption and modernization journey and has an interest in the security and analytics field. Outside of work, she loves to spend time with her dog and play pickleball.

Run Kinesis Agent on Amazon ECS

2024-01-02 Buddhike de Silva

Post Syndicated from Buddhike de Silva original https://aws.amazon.com/blogs/big-data/run-kinesis-agent-on-amazon-ecs/

Kinesis Agent is a standalone Java software application that offers a straightforward way to collect and send data to Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. The agent continuously monitors a set of files and sends new data to the desired destination. The agent handles file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. It also emits Amazon CloudWatch metrics to help you better monitor and troubleshoot the streaming process.

This post describes the steps to send data from a containerized application to Kinesis Data Firehose using Kinesis Agent. More specifically, we show how to run Kinesis Agent as a sidecar container for an application running in Amazon Elastic Container Service (Amazon ECS). After the data is in Kinesis Data Firehose, it can be sent to any supported destination, such as Amazon Simple Storage Service (Amazon S3).

In order to present the key points required for this setup, we assume that you are familiar with Amazon ECS and working with containers. We also avoid the implementation details and packaging process of our test data generation application, referred to as the producer.

Solution overview

As depicted in the following figure, we configure a Kinesis Agent container as a sidecar that can read files created by the producer container. In this instance, the producer and Kinesis Agent containers share data via a bind mount in Amazon ECS.

Solution design diagram

Prerequisites

You should satisfy the following prerequisites for the successful completion of this task:

Familiarity working with containers and Amazon ECS
Docker Desktop installed
The AWS Command Line Interface (AWS CLI) installed
An ECS cluster
An Amazon Elastic Container Registry (Amazon ECR) repository to store the Kinesis Agent container image

With these prerequisites in place, you can begin next step to package a Kinesis Agent and your desired agent configuration as a container in your local development machine.

Create a Kinesis Agent configuration file

We use the Kinesis Agent configuration file to configure the source and destination, among other data transfer settings. The following code uses the minimal configuration required to read the contents of files matching /var/log/producer/*.log and publish them to a Kinesis Data Firehose delivery stream called kinesis-agent-demo:

{
    "firehose.endpoint": "firehose.ap-southeast-2.amazonaws.com",
    "flows": [
        {
            "deliveryStream": "kinesis-agent-demo",
            "filePattern": "/var/log/producer/*.log"
        }
    ]
}

Create a container image for Kinesis Agent

To deploy Kinesis Agent as a sidecar in Amazon ECS, you first have to package it as a container image. The container must have Kinesis Agent, which and find binaries, and the Kinesis Agent configuration file that you prepared earlier. Its entry point must be configured using the start-aws-kinesis-agent script. This command is installed when you run the yum install aws-kinesis-agent step. The resulting Dockerfile should look as follows:

FROM amazonlinux

RUN yum install -y aws-kinesis-agent which findutils
COPY agent.json /etc/aws-kinesis/agent.json

CMD ["start-aws-kinesis-agent"]

Run the docker build command to build this container:

docker build -t kinesis-agent .

After the image is built, it should be pushed to a container registry like Amazon ECR so that you can reference it in the next section.

Create an ECS task definition with Kinesis Agent and the application container

Now that you have Kinesis Agent packaged as a container image, you can use it in your ECS task definitions to run as sidecar. To do that, you create an ECS task definition with your application container (called producer) and Kinesis Agent container. All containers in a task definition are scheduled on the same container host and therefore can share resources such as bind mounts.

In the following sample container definition, we use a bind mount called logs_dir to share a directory between the producer container and kinesis-agent container.

You can use the following template as a starting point, but be sure to change taskRoleArn and executionRoleArn to valid IAM roles in your AWS account. In this instance, the IAM role used for taskRoleArn must have write permissions to Kinesis Data Firehose that you specified earlier in the agent.json file. Additionally, make sure that the ECR image paths and awslogs-region are modified as per your AWS account.

{
    "family": "kinesis-agent-demo",
    "taskRoleArn": "arn:aws:iam::111111111:role/kinesis-agent-demo-task-role",
    "executionRoleArn": "arn:aws:iam::111111111:role/kinesis-agent-test",
    "networkMode": "awsvpc",
    "containerDefinitions": [
        {
            "name": "producer",
            "image": "111111111.dkr.ecr.ap-southeast-2.amazonaws.com/producer:latest",
            "cpu": 1024,
            "memory": 2048,
            "essential": true,
            "command": [
                "-output",
                "/var/log/producer/test.log"
            ],
            "mountPoints": [
                {
                    "sourceVolume": "logs_dir",
                    "containerPath": "/var/log/producer",
                    "readOnly": false
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "producer",
                    "awslogs-stream-prefix": "producer",
                    "awslogs-region": "ap-southeast-2"
                }
            }
        },
        {
            "name": "kinesis-agent",
            "image": "111111111.dkr.ecr.ap-southeast-2.amazonaws.com/kinesis-agent:latest",
            "cpu": 1024,
            "memory": 2048,
            "essential": true,
            "mountPoints": [
                {
                    "sourceVolume": "logs_dir",
                    "containerPath": "/var/log/producer",
                    "readOnly": true
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "kinesis-agent",
                    "awslogs-stream-prefix": "kinesis-agent",
                    "awslogs-region": "ap-southeast-2"
                }
            }
        }
    ],
    "volumes": [
        {
            "name": "logs_dir"
        }
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "2048",
    "memory": "4096"
}

aws ecs register-task-definition --cli-input-json file://./task-definition.json

Run a new ECS task

Finally, you can run a new ECS task using the task definition you just created using the aws ecs run-task command. When the task is started, you should be able to see two containers running under that task on the Amazon ECS console.

Amazon ECS console screenshot

Conclusion

This post showed how straightforward it is to run Kinesis Agent in a containerized environment. Although we used Amazon ECS as our container orchestration service in this post, you can use a Kinesis Agent container in other environments such as Amazon Elastic Kubernetes Service (Amazon EKS).

To learn more about using Kinesis Agent, refer to Writing to Amazon Kinesis Data Streams Using Kinesis Agent. For more information about Amazon ECS, refer to the Amazon ECS Developer Guide.

About the Author

Buddhike de Silva is a Senior Specialist Solutions Architect at Amazon Web Services. Buddhike helps customers run large scale streaming analytics workloads on AWS and make the best out of their cloud journey.

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

2023-11-14 Roy Wang

Post Syndicated from Roy Wang original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-celebrating-a-decade-of-real-time-data-innovation/

Data is a key strategic asset for every organization, and every company is a data business at its core. However, in many organizations, data is typically spread across a number of different systems such as software as a service (SaaS) applications, operational databases, and data warehouses. Such data silos make it difficult to get unified views of the data in an organization and act in real time to derive the most value.

Ten years ago, we launched Amazon Kinesis Data Streams, the first cloud-native serverless streaming data service, to serve as the backbone for companies, to move data across system boundaries, breaking data silos. With data streaming, you can power data lakes running on Amazon Simple Storage Service (Amazon S3), enrich customer experiences via personalization, improve operational efficiency with predictive maintenance of machinery in your factories, and achieve better insights with more accurate machine learning (ML) models. Amazon Kinesis Data Streams is a foundational data strategy pillar for tens of thousands of customers. As streams of raw data come together, they unlock capabilities to continuously transform, enrich, and query data in real time via seamless integration with stream processing engines such as Amazon Managed Service for Apache Flink.

As an example, the National Hockey League (NHL) reimagined the fan experience by streaming live NHL EDGE game data and stats to offer hockey fans valuable insights to keep fans at the edge of their seats. NHL EDGE technology in the puck and players’ sweaters (jerseys) generate thousands of data points every second for the NHL, which can be analyzed by AWS to predict likely outcomes for key events like face-offs. To process and analyze thousands of signals, the NHL built a real-time streaming data foundation with Kinesis Data Streams and Amazon Managed Service for Apache Flink to stream, prepare, and feed data into ML models, helping inform face-off predictions in seconds and expanding new ways to engage viewers.

Building on such streaming data foundations, many customers are currently thinking about how to deliver transformative new products and services with generative AI. Streaming allows companies to connect the data available within data stores to large language models (LLMs) securely and in real time. Although LLMs are capable of working with billions of parameters, in order to deliver an engaging experience that is tailored to a company’s customers, LLMs require personalization data for the company’s users and proprietary knowledge stores within the company’s data stores. A data strategy that incorporates streaming is necessary to deliver personalization and proprietary data that is available for querying in real time.

Customers with real-time streaming data strategy are at the cutting edge of providing innovative products with generative AI. One customer adopted Kinesis Data Streams for their data strategy, and they stream billions of events from their digital products to derive real-time insights. With a combination of low-latency data streaming and analytics, they are able to understand and personalize the user experience via a seamlessly integrated, self-reliant system for experimentation and automated feedback. Earlier this year, building on their already strong data foundation, they launched an innovative digital media generative AI product. The same data foundation built on Kinesis Data Streams is used to continuously analyze how users interact with the generated content and helps the product team fine-tune the application.

“Real-time streaming data technologies are essential for digital transformation. These services help customers bring data to their applications and models, making them smarter. Real-time data gives companies an advantage in data-driven decisions, predictions, and insights by using the data at the very moment it is generated, providing an unparalleled edge in a world where timing is the key to success. Bring the data in once, use it across your organization, and act before the value of that data diminishes.”

– Mindy Ferguson, VP of AWS Streaming and Messaging.

As we celebrate the tenth anniversary of Kinesis Data Streams, customers have shared four key reasons they continue to value this revolutionary service. They love how they can easily stream data with no underlying servers to provision or manage, operate at a massive scale with consistent performance, achieve high resiliency and durability, and benefit from broad integration with myriad sources and sinks to ingest and process data respectively.

Ease of use

Getting started with Kinesis Data Streams is straightforward: developers can create a data stream with a few clicks on the Kinesis Data Streams console or with a single API call. Changing the size or configuration is also a single API call, and each data stream comes with a default 24-hour data retention period. Developers don’t have to worry about clusters, version upgrades, or storage capacity planning. They just turn on a data stream and start ingesting data.

The needs of our customers have evolved in the past 10 years. As more events get captured and streamed, customers want their data streams to scale elastically without any operational overhead. In response, we launched On-Demand streams in 2021 to provide a simple and automatic scaling experience. With On-Demand streams, you let the service handle scaling up a stream’s capacity proactively, and you’re only charged for the actual data ingested, retrieved, and stored. As our customers continued to ask for more capabilities, we increased the ingestion throughput limit of each On-Demand stream from 200MB/s to 1GB/s in March 2023, and then to 2GB/s in October 2023, to accommodate higher throughput workloads. To continue innovating to be the easiest streaming data service to use, we actively listen to our customer use cases.

Canva is an online design and visual communication platform. As it has rapidly grown from 30 million to 135 million monthly users, it has built a streaming data platform at scale that is effortless to operate for driving product innovations and personalizing the user experience.

“Amazon Kinesis Data Streams and AWS Lambda are used throughout Canva’s logging platform, ingesting and processing over 60 billion log events per day. The combination of Kinesis Data Streams and Lambda has abstracted plenty of work that’s often required in managing a massive data pipeline, such as deploying and managing a fleet of servers, whilst also providing a highly scalable and reliable service. It has allowed us to focus on delivering a world-class product by building highly requested features rather than spending time on operational work.”

– Phoebe Zhou, Software Engineer at Canva.

Operate at massive scale with consistent performance

A fundamental requirement of a streaming data strategy is ingesting and processing large volumes of data with low latency. Kinesis Data Streams processes trillions of records per day across tens of thousands of customers. Customers run more than 3.5 million unique streams and process over 45 PB of data per day. Our largest customers ingest more than 15 GB per second of real-time data with individual streams. That’s equivalent to streaming multiple data points for every person on earth, every second! Even at this scale, all our customers still retrieve data within milliseconds of availability.

Customers also want to process the same data with multiple applications, with each deriving a different value, without worrying about one application impacting the read throughput of another. Enhanced Fan-out offers dedicated read throughput and low latency for each data consumer. This has enabled enterprise platform teams to provide real-time data to more teams and applications.

VMware Carbon Black uses Kinesis Data Streams to ingest petabytes of data every day to secure millions of customer endpoints. The team focuses on its expertise while AWS manages data streaming to meet growing customer traffic and needs in real time.

“When an individual customer’s data increases or decreases, we can use the elasticity of Amazon Kinesis Data Streams to scale compute up or down to process data reliably while effectively managing our cost. This is why Kinesis Data Streams is a good fit. The biggest advantage is the managed nature of our solution on AWS. This has shaped our architecture and helped us shift complexity elsewhere.”

– Stoyan Dimkov, Staff Engineer and Software Architect at VMware Carbon Black.

Learn more about the case study.

Provide resiliency and durability for data streaming

With burgeoning data, customers want more flexibility in processing and reprocessing data. For example, if an application that is consuming data goes offline for a period, teams want to ensure that they resume processing at a later time without data loss. Kinesis Data Streams provides a default 24-hour retention period, enabling you to select a specific timestamp from which to start processing records. With the extended retention feature, you can configure the data retention period to be up to 7 days.

Some industries like financial services and healthcare have stricter compliance requirements, so customers asked for even longer data retention periods to support these requirements. Therefore, we followed up with long-term storage that supports data retention for up to 1 year. Now, thousands of Kinesis Data Streams customers use these features to make their streaming applications more resilient and durable.

Mercado Libre, a leading ecommerce and payments platform in Latin America, relies on Kinesis Data Streams to power its streaming data strategy around payment processing, customer experience, and operations.

“With Amazon Kinesis Data Streams at the core, we process approximately 70 billion daily messages distributed across thousands of data producers. By leveraging Kinesis Data Streams and Amazon DynamoDB Streams, we’ve embraced an event-driven architecture and are able to swiftly respond to data changes.”

– Joaquin Fernandez, Senior Software Expert at Mercado Libre.

Access your data no matter where it lives

Our customers use a wide variety of tools and applications, and an organization’s data often resides in many places. Therefore, the ability to easily integrate data across an organization is crucial to derive timely insights. Developers use the Kinesis Producer Library, Kinesis Client Library, and AWS SDK to quickly build custom data producer and data consumer applications. Customers have expanded their data producers ranging from microservices to smart TVs and even cars. We have over 40 integrations with AWS services and third-party applications like Adobe Experience Platform and Databricks. As detailed in our whitepaper on building a modern data streaming architecture on AWS, Kinesis Data Streams serves as the backbone to serverless and real-time use cases such as personalization, real-time insights, Internet of Things (IoT), and event-driven architecture. Our recent integration with Amazon Redshift enables you to ingest hundreds of megabytes of data from Kinesis Data Streams into data warehouses in seconds. To learn more about how to use this integration to detect fraud in near-real time, refer to Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML.

Another integration launched in 2023 is with Amazon Monitron to power predictive maintenance management. You can now stream measurement data and the corresponding inference results to Kinesis Data Streams, coordinate predictive maintenance, and build an IoT data lake. For more details, refer to Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis.

Next, let’s go back to the NHL use case where they combine IoT, data streaming, and machine learning.

The NHL Edge IQ powered by AWS is helping bring fans closer to the action with advanced analytics and new ML stats such as Face-off Probability and Opportunity Analysis.

“We use Amazon Kinesis Data Streams to process NHL EDGE data on puck and Player positions, face-off location, and the current game situation to decouple data producers from consuming applications. Amazon Managed Service for Apache Flink is used to run Flink applications and consumes data from Kinesis Data Streams to call the prediction model in Amazon SageMaker to deliver the real-time Face-off Probability metric. The probability results are also stored in Amazon S3 to continuously retrain the model in SageMaker. The success of this project led us to build the next metric, Opportunity Analysis, which delivers over 25 insights into the quality of the scoring opportunity presented by each shot on goal. Kinesis Data Streams and Amazon Managed Service for Apache Flink applications were critical to making live, in-game predictions, enabling the system to perform opportunity analysis calculations for up to 16 live NHL games simultaneously.”

– Eric Schneider, SVP, Software Engineering at National Hockey League.

Learn more about the case study.

The future of data is real time

The fusion of real-time data streaming and generative AI promises to be the cornerstone of our digitally connected world. Generative AI, empowered by a constant influx of real-time information from IoT devices, sensors, social media, and beyond, is becoming ubiquitous. From autonomous vehicles navigating dynamically changing traffic conditions to smart cities optimizing energy consumption based on real-time demand, the combination of AI and real-time data will underpin efficiency and innovation across industries. Ubiquitous, adaptive, and deeply integrated into our lives, these AI-driven applications will enhance convenience and address critical challenges such as climate change, healthcare, and disaster response by using the wealth of real-time insights at their disposal. With Kinesis Data Streams, organizations can build a solid data foundation, positioning you to quickly adopt new technologies and unlock new opportunities sooner—which we anticipate will be enormous.

Learn more about what our customers are doing with data streaming. If you would like a quick exploration of Kinesis Data Streams concepts and use cases, check out our Amazon Kinesis Data Streams 101 playlist. To get started with building your data streams, visit the Amazon Kinesis Data Streams Developer Guide.

About the author

Roy (KDS) Wang is a Senior Product Manager with Amazon Kinesis Data Streams. He is passionate about learning from and collaborating with customers to help organizations run faster and smarter. Outside of work, Roy strives to be a good dad to his new son and builds plastic model kits.

Processing large records with Amazon Kinesis Data Streams

2023-10-16 Masudur Rahaman Sayem

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/processing-large-records-with-amazon-kinesis-data-streams/

In today’s digital era, data is abundant and constantly flowing. Businesses across industries are seeking ways to harness this wealth of information to gain valuable insights and make real-time decisions. To meet this need, AWS offers Amazon Kinesis Data Streams, a powerful and scalable real-time data streaming service. With Kinesis Data Streams, you can effortlessly collect, process, and analyze streaming data in real time at any scale. This service seamlessly integrates into your data architecture, allowing you to tap into the full potential of your data for informed decision-making.

Data streaming technologies like Kinesis Data Streams are designed to efficiently process and manage continuous streams of data in real time at large scale. The individual pieces of data within these streams are often referred to as records. In scenarios like large file processing or performing image, audio, or video analytics, your record may exceed 1 MB. You may struggle to ingest such a large record with Kinesis Data Streams because, as of this writing, the service has a 1 MB upper limit for maximum data record size.

In this post, we show you some different options for handling large records within Kinesis Data Streams and the benefits and disadvantages of each approach. We provide some sample code for each option to help you get started with any of these approaches with your own workloads.

Understanding the default behavior of Kinesis Data Streams

You can send records to Kinesis Data Streams using the PutRecord or PutRecords API calls. These APIs include a mandatory field known as PartitionKey, where you must provide a specific value. This partition key is used by the service to map records with the same partition keys to the same shard to ensure ordering and locality for consumption. Locality means that you want the same consumer to process all records for a given partition key. This helps ensure that data with the same partition key stays together within the same shard, maintaining data order.

Each shard, which holds your data, can handle writing up to 1 MB per second. Let’s consider a scenario where you define a partition key and attempt to send a data record that exceeds 1 MB in size. Based on the explanation so far, the service will reject this request because the record size is over 1 MB. To help you understand better, we experimented by trying to send a record of 1.5 MB to a stream, and the outcome was the following exception message:

import json
import boto3
client = boto3.client('kinesis', region_name='ap-southeast-2')

def lambda_handler(event, context):
    try:
        response = client.put_record(
            StreamName='test',
            Data=b'Sample 1 MB....',
            PartitionKey='string'
            #StreamARN='string'
        )
    
    except Exception as e:
        print (e)

START RequestId: 84b3ab0c-3f30-4267-aec1-549c2d59dfdb Version: $LATEST An error occurred (ValidationException) when calling the PutRecord operation: 1 validation error detected: Value at 'data' failed to satisfy constraint: Member must have length less than or equal to 1048576 END RequestId: 84b3ab0c-3f30-4267-aec1-549c2d59dfdb

Strategies for handling large records

Now that we understand the behavior of the PutRecord and PutRecords APIs, let’s discuss strategies you can use to overcome this situation. One thing to keep in mind is that there is no single best solution; in the following sections, we discuss some of the approaches that you can evaluate based on your use case:

Store large records in Amazon Simple Storage Service (Amazon S3) with a reference in Kinesis Data Streams
Split one large record into multiple records
Compress your large records

Let’s discuss these points one by one.

Store large records in Amazon S3 with a reference in Kinesis Data Streams

A useful approach for storing large records involves utilizing an alternative storage solution while employing a reference within Kinesis Data Streams. In this context, Amazon S3 stands out as an excellent choice due to its exceptional durability and cost-effectiveness. The procedure involves uploading the record as an object to an S3 bucket and subsequently writing a reference entry in Kinesis Data Streams. This entry incorporates an attribute that serves as a pointer, indicating the location of the object within Amazon S3.

With this approach, you can generate a pre-signed URL associated with the S3 object’s location. This link can be shared with the requester, offering them direct access to the object without the need for intermediary server-side data transfers.

The following diagram illustrates the architecture of this solution.

The following is the sample code to write data to Kinesis Data Streams using this approach:

import json
import boto3
import random

def lambda_handler(event, context):
    try:
        s3 = boto3.client('s3', region_name='ap-southeast-2')
        kds = boto3.client('kinesis', region_name='ap-southeast-2')
        expiration=3600
        pk=str(random.randint(100,100000000))
        bucket_name = 'MY_BUCKET'
        object_key = 'air/' + pk + '.txt'
        file_content = b'LARGE OBJECT'
        response = s3.put_object(Bucket=bucket_name, Key=object_key, Body=file_content)
        presigned_url = s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket_name, 'Key': object_key},
            ExpiresIn=expiration
        )
        
        kdata = {'message': presigned_url}
        response = kds.put_record(
            StreamName='test',
            Data=json.dumps(kdata),
            PartitionKey=pk
        )
        print (response)
    except Exception as e:
        print (e)

If you are using an AWS Lambda consumer to process this data, you can now decode the record to get the S3 pre-signed URL to efficiently retrieve the object from Amazon S3. Then you can implement your business logic to effectively process the data. The following is sample code for reference:

import json
import base64
import json

def lambda_handler(event, context):
    item = None
    decoded_record_data = [base64.b64decode(record['kinesis']['data']).decode().replace('\n','') for record in event['Records']]
    deserialized_data = [json.loads(decoded_record) for decoded_record in decoded_record_data]
    
    
    for item in deserialized_data:
        LOB=(item['message'])
        #process LOB implementing your business logic

An inherent benefit of adopting this technique is the capability to store data in Amazon S3, accommodating an extensive range of sizes per individual object. This method helps you reduce the costs of using Kinesis Data Streams because it uses less storage space and requires fewer read and write throughput for item access. This optimization is achieved by storing just the URL within Kinesis Data Streams. However, it’s important to acknowledge that accessing the sizable object necessitates an additional call to Amazon S3, thereby introducing higher latency for clients as they manage the additional request.

Split one large record into multiple records

Splitting large records into smaller ones in Kinesis Data Streams brings advantages like faster processing, improved throughput, efficient resource use, and more straightforward error handling. Let’s say you have a large record that you want to split into smaller chunks before sending them to a Kinesis data stream. First, you need to set up a Kinesis producer. Suppose you have a large record as a string. You can split it into smaller chunks of a predefined size. For this example, let’s say you’re splitting the record into chunks of 100 characters each. After you split that, loop through the record chunks and send each chunk as a separate message to a Kinesis data stream. The following is the sample code:

import boto3
kinesis = boto3.client('kinesis', region_name='ap-southeast-2')  

def split_record(record, chunk_size):
    chunks = [record[i:i + chunk_size] for i in range(0, len(record), chunk_size)]
    return chunks

def send_to_kinesis(stream_name, record):
    response = kinesis.put_record(
        StreamName=stream_name,
        Data=record,
        PartitionKey= '100'
    )
    return response

def main():
    stream_name = 'test'  
    large_record = 'Your large record'  # Replace with your actual record
    chunk_size = 100  

    record_chunks = split_record(large_record, chunk_size)

    for chunk in record_chunks:
        response = send_to_kinesis(stream_name, chunk)
        print(f"Record sent: {response['SequenceNumber']}")

if __name__ == "__main__":
    main()

Ensure that all chunks of a given message are directed to a single partition, thereby guaranteeing the preservation of their order. In the final chunk, include metadata within the header indicating the conclusion of the message during production. This enables consumers to identify the ultimate chunk and facilitates seamless message reconstruction. The drawback of this method is that it adds complexity to the client-side tasks of dividing and putting back together the different parts. Therefore, these functions need thorough testing to prevent any loss of data.

Compress your large records

Applying data compression prior to transmitting it to Kinesis Data Streams has numerous advantages. This approach not only reduces the data’s size, enabling swifter travel and more efficient utilization of network resources, but also leads to cost savings in terms of storage expenses while optimizing overall resource consumption. Additionally, this practice simplifies storage and data retention. By using compression algorithms such as GZIP, Snappy, or LZ4, you can achieve substantial reduction in the size of large records. Compression brings the benefit of simplicity because it’s implemented seamlessly without requiring the caller to make changes to the item or use extra AWS services to support storage. However, compression introduces additional CPU overhead and latency on the producer side, and its impact on the compression ratio and efficiency can vary depending on the data type and format. Also, compression can enhance consumer throughput at the expense of some decompression overhead.

Conclusion

For real-time data streaming use cases, it’s essential to carefully consider the handling of large records when using Kinesis Data Streams. In this post, we discussed the challenges associated with managing large records and explored strategies such as utilizing Amazon S3 references, record splitting, and compression. Each approach has its own set of benefits and drawbacks, so it’s crucial to evaluate the nature of your data and the tasks you need to perform. Select the most suitable approach based on your data’s characteristics and your processing task requirements.

We encourage you to try out the approaches discussed in this post and share your thoughts in the comments section.

About the author

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

2023-10-02 M Mehrtens

Post Syndicated from M Mehrtens original https://aws.amazon.com/blogs/big-data/non-json-ingestion-using-amazon-kinesis-data-streams-amazon-msk-and-amazon-redshift-streaming-ingestion/

Organizations are grappling with the ever-expanding spectrum of data formats in today’s data-driven landscape. From Avro’s binary serialization to the efficient and compact structure of Protobuf, the landscape of data formats has expanded far beyond the traditional realms of CSV and JSON. As organizations strive to derive insights from these diverse data streams, the challenge lies in seamlessly integrating them into a scalable solution.

In this post, we dive into Amazon Redshift Streaming Ingestion to ingest, process, and analyze non-JSON data formats. Amazon Redshift Streaming Ingestion allows you to connect to Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) directly through materialized views, in real time and without the complexity associated with staging the data in Amazon Simple Storage Service (Amazon S3) and loading it into the cluster. These materialized views not only provide a landing zone for streaming data, but also offer the flexibility of incorporating SQL transforms and blending into your extract, load, and transform (ELT) pipeline for enhanced processing. For a deeper exploration on configuring and using streaming ingestion in Amazon Redshift, refer to Real-time analytics with Amazon Redshift streaming ingestion.

JSON data in Amazon Redshift

Amazon Redshift enables storage, processing, and analytics on JSON data through the SUPER data type, PartiQL language, materialized views, and data lake queries. The base construct to access streaming data in Amazon Redshift provides metadata from the source stream (attributes like stream timestamp, sequence numbers, refresh timestamp, and more) and the raw binary data from the stream itself. For streams that contain the raw binary data encoded in JSON format, Amazon Redshift provides a variety of tools for parsing and managing the data. For more information about the metadata of each stream format, refer to Getting started with streaming ingestion from Amazon Kinesis Data Streams and Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka.

At the most basic level, Amazon Redshift allows parsing the raw data into distinct columns. The JSON_EXTRACT_PATH_TEXT and JSON_EXTRACT_ARRAY_ELEMENT_TEXT functions enable the extraction of specific details from JSON objects and arrays, transforming them into separate columns for analysis. When the structure of the JSON documents and specific reporting requirements are defined, these methods allow for pre-computing a materialized view with the exact structure needed for reporting, with improved compression and sorting for analytics.

In addition to this approach, the Amazon Redshift JSON functions allow storing and analyzing the JSON data in its original state using the adaptable SUPER data type. The function JSON_PARSE allows you to extract the binary data in the stream and convert it into the SUPER data type. With the SUPER data type and PartiQL language, Amazon Redshift extends its capabilities for semi-structured data analysis. It uses the SUPER data type for JSON data storage, offering schema flexibility within a column. For more information on using the SUPER data type, refer to Ingesting and querying semistructured data in Amazon Redshift. This dynamic capability simplifies data ingestion, storage, transformation, and analysis of semi-structured data, enriching insights from diverse sources within the Redshift environment.

Streaming data formats

Organizations using alternative serialization formats must explore different deserialization methods. In the next section, we dive into the optimal approach for deserialization. In this section, we take a closer look at the diverse formats and strategies organizations use to effectively manage their data. This understanding is key in determining the data parsing approach in Amazon Redshift.

Many organizations use a format other than JSON for their streaming use cases. JSON is a self-describing serialization format, where the schema of the data is stored alongside the actual data itself. This makes JSON flexible for applications, but this approach can lead to increased data transmission between applications due to the additional data contained in the JSON keys and syntax. Organizations seeking to optimize their serialization and deserialization performance, and their network communication between applications, may opt to use a format like Avro, Protobuf, or even a custom proprietary format to serialize application data into binary format in an optimized way. This provides the advantage of an efficient serialization where only the message values are packed into a binary message. However, this requires the consumer of the data to know what schema and protocol was used to serialize the data to deserialize the message. There are several ways that organizations can solve this problem, as illustrated in the following figure.

Visualization of different binary message serialization approaches

Embedded schema

In an embedded schema approach, the data format itself contains the schema information alongside the actual data. This means that when a message is serialized, it includes both the schema definition and the data values. This allows anyone receiving the message to directly interpret and understand its structure without needing to refer to an external source for schema information. Formats like JSON, MessagePack, and YAML are examples of embedded schema formats. When you receive a message in this format, you can immediately parse it and access the data with no additional steps.

Assumed schema

In an assumed schema approach, the message serialization contains only the data values, and there is no schema information included. To interpret the data correctly, the receiving application needs to have prior knowledge of the schema that was used to serialize the message. This is typically achieved by associating the schema with some identifier or context, like a stream name. When the receiving application reads a message, it uses this context to retrieve the corresponding schema and then decodes the binary data accordingly. This approach requires an additional step of schema retrieval and decoding based on context. This generally requires setting up a mapping in-code or in an external database so that consumers can dynamically retrieve the schemas based on stream metadata (such as the AWS Glue Schema Registry).

One drawback of this approach is in tracking schema versions. Although consumers can identify the relevant schema from the stream name, they can’t identify the particular version of the schema that was used. Producers need to ensure that they are making backward-compatible changes to schemas to ensure consumers aren’t disrupted when using a different schema version.

Embedded schema ID

In this case, the producer continues to serialize the data in binary format (like Avro or Protobuf), similar to the assumed schema approach. However, an additional step is involved: the producer adds a schema ID at the beginning of the message header. When a consumer processes the message, it starts by extracting the schema ID from the header. With this schema ID, the consumer then fetches the corresponding schema from a registry. Using the retrieved schema, the consumer can effectively parse the rest of the message. For example, the AWS Glue Schema Registry provides Java SDK SerDe libraries, which can natively serialize and deserialize messages in a stream using embedded schema IDs. Refer to How the schema registry works for more information about using the registry.

The usage of an external schema registry is common in streaming applications because it provides a number of benefits to consumers and developers. This registry contains all the message schemas for the applications and associates them with a unique identifier to facilitate schema retrieval. In addition, the registry may provide other functionalities like schema version change handling and documentation to facilitate application development.

The embedded schema ID in the message payload can contain version information, ensuring publishers and consumers are always using the same schema version to manage data. When schema version information isn’t available, schema registries can help enforce producers making backward-compatible changes to avoid causing issues in consumers. This helps decouple producers and consumers, provides schema validation at both the publisher and consumer stage, and allows for more flexibility in stream usage to allow for a variety of application requirements. Messages can be published with one schema per stream, or with multiple schemas inside a single stream, allowing consumers to dynamically interpret messages as they arrive.

For a deeper dive into the benefits of a schema registry, refer to Validate streaming data over Amazon MSK using schemas in cross-account AWS Glue Schema Registry.

Schema in file

For batch processing use cases, applications may embed the schema used to serialize the data into the data file itself to facilitate data consumption. This is an extension of the embedded schema approach but is less costly because the data file is generally larger, so the schema accounts for a proportionally smaller amount of the overall data. In this case, the consumers can process the data directly without additional logic. Amazon Redshift supports loading Avro data that has been serialized in this manner using the COPY command.

Convert non-JSON data to JSON

Organizations aiming to use non-JSON serialization formats need to develop an external method for parsing their messages outside of Amazon Redshift. We recommend using an AWS Lambda-based external user-defined function (UDF) for this process. Using an external Lambda UDF allows organizations to define arbitrary deserialization logic to support any message format, including embedded schema, assumed schema, and embedded schema ID approaches. Although Amazon Redshift supports defining Python UDFs natively, which may be a viable alternative for some use cases, we demonstrate the Lambda UDF approach in this post to cover more complex scenarios. For examples of Amazon Redshift UDFs, refer to AWS Samples on GitHub.

The basic architecture for this solution is as follows.

See the following code:

-- Step 1
CREATE OR REPLACE EXTERNAL FUNCTION fn_lambda_decode_avro_binary(varchar)
RETURNS varchar IMMUTABLE LAMBDA 'redshift-avro-udf';

-- Step 2
CREATE EXTERNAL SCHEMA kds FROM KINESIS

-- Step 3
CREATE MATERIALIZED VIEW {name} AUTO REFRESH YES AS
SELECT
    -- Step 4
   t.kinesis_data AS binary_avro,
   to_hex(binary_avro) AS hex_avro,
   -- Step 5
   fn_lambda_decode_avro_binary('{stream-name}', hex_avro) AS json_string,
   -- Step 6
   JSON_PARSE(json_string) AS super_data,
   t.sequence_number,
   t.refresh_time,
   t.approximate_arrival_timestamp,
   t.shard_id
FROM kds.{stream_name} AS t

Let’s explore each step in more detail.

Create the Lambda UDF

The overall goal is to develop a method that can accept the raw data as input and produce JSON-encoded data as an output. This aligns with the Amazon Redshift ability to natively process JSON into the SUPER data type. The specifics of the function depend on the serialization and streaming approach. For example, using the assumed schema approach with Avro format, your Lambda function may complete the following steps:

Take in the stream name and hexadecimal-encoded data as inputs.
Use the stream name to perform a lookup to identify the schema for the given stream name.
Decode the hexadecimal data into binary format.
Use the schema to deserialize the binary data into readable format.
Re-serialize the data into JSON format.

The f_glue_schema_registry_avro_to_json AWS samples example illustrates the process of decoding Avro using the assumed schema approach using the AWS Glue Schema Registry in a Lambda UDF to retrieve and use Avro schemas by stream name. For other approaches (such as embedded schema ID), you should author your Lambda function to handle deserialization as defined by your serialization process and schema registry implementation. If your application depends on an external schema registry or table lookup to process the message schema, we recommend that you implement caching for schema lookups to help reduce the load on the external systems and reduce the average Lambda function invocation duration.

When creating the Lambda function, make sure you accommodate the Amazon Redshift input event format and ensure compliance with the expected Amazon Redshift event output format. For details, refer to Creating a scalar Lambda UDF.

After you create and test the Lambda function, you can define it as a UDF in Amazon Redshift. For effective integration within Amazon Redshift, designate this Lambda function UDF as IMMUTABLE. This classification supports incremental materialized view updates. This treats the Lambda function as idempotent and minimizes the Lambda function costs for the solution, because a message doesn’t need to be processed if it has been processed before.

Configure the baseline Kinesis data stream

Regardless of your messaging format or approach (embedded schema, assumed schema, and embedded schema ID), you begin with setting up the external schema for streaming ingestion from your messaging source into Amazon Redshift. For more information, refer to Streaming ingestion.

CREATE EXTERNAL SCHEMA kds FROM KINESIS

IAM_ROLE 'arn:aws:iam::0123456789:role/redshift-streaming-role';

Create the raw materialized view

Next, you define your raw materialized view. This view contains the raw message data from the streaming source in Amazon Redshift VARBYTE format.

Convert the VARBYTE data to VARCHAR format

External Lambda function UDFs don’t support VARBYTE as an input data type. Therefore, you must convert the raw VARBYTE data from the stream into VARCHAR format to pass to the Lambda function. The best way to do this in Amazon Redshift is using the TO_HEX built-in method. This converts the binary data into hexadecimal-encoded character data, which can be sent to the Lambda UDF.

Invoke the Lambda function to retrieve JSON data

After the UDF has been defined, we can invoke the UDF to convert our hexadecimal-encoded data into JSON-encoded VARCHAR data.

Use the JSON_PARSE method to convert the JSON data to SUPER data type

Finally, we can use the Amazon Redshift native JSON parsing methods like JSON_PARSE, JSON_EXTRACT_PATH_TEXT, and more to parse the JSON data into a format that we can use for analytics.

Considerations

Consider the following when using this strategy:

Cost – Amazon Redshift invokes the Lambda function in batches to improve scalability and reduce the overall number of Lambda invocations. The cost of this solution depends on the number of messages in your stream, the frequency of the refresh, and the invocation time required to process the messages in a batch from Amazon Redshift. Using the IMMUTABLE UDF type in Amazon Redshift can also help minimize costs by utilizing the incremental refresh strategy for the materialized view.
Permissions and network access – The AWS Identity and Access Management (IAM) role used for the Amazon Redshift UDF must have permissions to invoke the Lambda function, and you must deploy the Lambda function such that it has access to invoke its external dependencies (for example, you may need to deploy it in a VPC to access private resources like a schema registry).
Monitoring – Use Lambda function logging and metrics to identify errors in deserialization, connection to the schema registry, and data processing. For details on monitoring the UDF Lambda function, refer to Embedding metrics within logs and Monitoring and troubleshooting Lambda functions.

Conclusion

In this post, we dove into different data formats and ingestion methods for a streaming use case. By exploring strategies for handling non-JSON data formats, we examined the use of Amazon Redshift streaming to seamlessly ingest, process, and analyze these formats in near-real time using materialized views.

Furthermore, we navigated through schema-per-stream, embedded schema, assumed schema, and embedded schema ID approaches, highlighting their merits and considerations. To bridge the gap between non-JSON formats and Amazon Redshift, we explored the creation of Lambda UDFs for data parsing and conversion. This approach offers a comprehensive means to integrate diverse data streams into Amazon Redshift for subsequent analysis.

As you navigate the ever-evolving landscape of data formats and analytics, we hope this exploration provides valuable guidance to derive meaningful insights from your data streams. We welcome any thoughts or questions in the comments section.

About the Authors

M Mehrtens has been working in distributed systems engineering throughout their career, working as a Software Engineer, Architect, and Data Engineer. In the past, M has supported and built systems to process terrabytes of streaming data at low latency, run enterprise Machine Learning pipelines, and created systems to share data across teams seamlessly with varying data toolsets and software stacks. At AWS, they are a Sr. Solutions Architect supporting US Federal Financial customers.

Sindhu Achuthan is a Sr. Solutions Architect with Federal Financials at AWS. She works with customers to provide architectural guidance on analytics solutions using AWS Glue, Amazon EMR, Amazon Kinesis, and other services. Outside of work, she loves DIYs, to go on long trails, and yoga.

How Chime Financial uses AWS to build a serverless stream analytics platform and defeat fraudsters

2023-09-19 Khandu Shinde

Post Syndicated from Khandu Shinde original https://aws.amazon.com/blogs/big-data/how-chime-financial-uses-aws-to-build-a-serverless-stream-analytics-platform-and-defeat-fraudsters/

This is a guest post by Khandu Shinde, Staff Software Engineer and Edward Paget, Senior Software Engineering at Chime Financial.

Chime is a financial technology company founded on the premise that basic banking services should be helpful, easy, and free. Chime partners with national banks to design member first financial products. This creates a more competitive market with better, lower-cost options for everyday Americans who aren’t being served well by traditional banks. We help drive innovation, inclusion, and access across the industry.

Chime has a responsibility to protect our members against unauthorized transactions on their accounts. Chime’s Risk Analysis team constantly monitors trends in our data to find patterns that indicate fraudulent transactions.

This post discusses how Chime utilizes AWS Glue, Amazon Kinesis, Amazon DynamoDB, and Amazon SageMaker to build an online, serverless fraud detection solution — the Chime Streaming 2.0 system.

Problem statement

In order to keep up with the rapid movement of fraudsters, our decision platform must continuously monitor user events and respond in real-time. However, our legacy data warehouse-based solution was not equipped for this challenge. It was designed to manage complex queries and business intelligence (BI) use cases on a large scale. However, with a minimum data freshness of 10 minutes, this architecture inherently didn’t align with the near real-time fraud detection use case.

To make high-quality decisions, we need to collect user event data from various sources and update risk profiles in real time. We also need to be able to add new fields and metrics to the risk profiles as our team identifies new attacks, without needing engineering intervention or complex deployments.

We decided to explore streaming analytics solutions where we can capture, transform, and store event streams at scale, and serve rule-based fraud detection models and machine learning (ML) models with milliseconds latency.

Solution overview

The following diagram illustrates the design of the Chime Streaming 2.0 system.

The design included the following key components:

We have Amazon Kinesis Data Streams as our streaming data service to capture and store event streams at scale. Our stream pipelines capture various event types, including user enrollment events, user login events, card swipe events, peer-to-peer payments, and application screen actions.
Amazon DynamoDB is another data source for our Streaming 2.0 system. It acts as the application backend and stores data such as blocked devices list and device-user mapping. We mainly use it as lookup tables in our pipeline.
AWS Glue jobs form the backbone of our Streaming 2.0 system. The simple AWS Glue icon in the diagram represents thousands of AWS Glue jobs performing different transformations. To achieve the 5-15 seconds end-to-end data freshness service level agreement (SLA) for the Steaming 2.0 pipeline, we use streaming ETL jobs in AWS Glue to consume data from Kinesis Data Streams and apply near-real-time transformation. We choose AWS Glue mainly due to its serverless nature, which simplifies infrastructure management with automatic provisioning and worker management, and the ability to perform complex data transformations at scale.
The AWS Glue streaming jobs generate derived fields and risk profiles that get stored in Amazon DynamoDB. We use Amazon DynamoDB as our online feature store due to its millisecond performance and scalability.
Our applications call Amazon SageMaker Inference endpoints for fraud detections. The Amazon DynamoDB online feature store supports real-time inference with single digit millisecond query latency.
We use Amazon Simple Storage Service (Amazon S3) as our offline feature store. It contains historical user activities and other derived ML features.
Our data scientist team can access the dataset and perform ML model training and batch inferencing using Amazon SageMaker.

AWS Glue pipeline implementation deep dive

There are several key design principles for our AWS Glue Pipeline and the Streaming 2.0 project.

We want to democratize our data platform and make the data pipeline accessible to all Chime developers.
We want to implement cloud financial backend services and achieve cost efficiency.

To achieve data democratization, we needed to enable different personas in the organization to use the platform and define transformation jobs quickly, without worrying about the actual implementation details of the pipelines. The data infrastructure team built an abstraction layer on top of Spark and integrated services. This layer contained API wrappers over integrated services, job tags, scheduling configurations and debug tooling, hiding Spark and other lower-level complexities from end users. As a result, end users were able to define jobs with declarative YAML configurations and define transformation logic with SQL. This simplified the onboarding process and accelerated the implementation phase.

To achieve cost efficiency, our team built a cost attribution dashboard based on AWS cost allocation tags. We enforced tagging with the above abstraction layer and had clear cost attribution for all AWS Glue jobs down to the team level. This enabled us to track down less optimized jobs and work with job owners to implement best practices with impact-based priority. One common misconfiguration we found was sizing of AWS Glue jobs. With data democratization, many users lacked the knowledge to right-size their AWS Glue jobs. The AWS team introduced AWS Glue auto scaling to us as a solution. With AWS Glue Auto Scaling, we no longer needed to plan AWS Glue Spark cluster capacity in advance. We could just set the maximum number of workers and run the jobs. AWS Glue monitors the Spark application execution, and allocates more worker nodes to the cluster in near-real time after Spark requests more executors based on our workload requirements. We noticed a 30–45% cost saving across our AWS Glue Jobs once we turned on Auto Scaling.

Conclusion

In this post, we showed you how Chime’s Streaming 2.0 system allows us to ingest events and make them available to the decision platform just seconds after they are emitted from other services. This enables us to write better risk policies, provide fresher data for our machine learning models, and protect our members from unauthorized transactions on their accounts.

Over 500 developers in Chime are using this streaming pipeline and we ingest more than 1 million events per second. We follow the sizing and scaling process from the AWS Glue streaming ETL jobs best practices blog and land on a 1:1 mapping between Kinesis Shard and vCPU core. The end-to-end latency is less than 15 seconds, and it improves the model score calculation speed by 1200% compared to legacy implementation. This system has proven to be reliable, performant, and cost-effective at scale.

We hope this post will inspire your organization to build a real-time analytics platform using serverless technologies to accelerate your business goals.

About the Authors

Khandu Shinde is a Staff Engineer focused on Big Data Platforms and Solutions for Chime. He helps to make the platform scalable for Chime’s business needs with architectural direction and vision. He’s based in San Francisco where he plays cricket and watches movies.

Edward Paget is a Software Engineer working on building Chime’s capabilities to mitigate risk to ensure our members’ financial peace of mind. He enjoys being at the intersection of big data and programming language theory. He’s based in Chicago where he spends his time running along the lake shore.

Dylan Qu is a Specialist Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Perform Amazon Kinesis load testing with Locust

2023-08-10 Luis Morales

Post Syndicated from Luis Morales original https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/

Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per second, and it’s challenging for developers to simulate high-traffic Amazon Kinesis-based applications to generate such load easily.

Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose are capable of capturing and storing terabytes of data per hour from numerous sources. Creating Kinesis data streams or Firehose delivery streams is straightforward through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Kinesis API. However, generating a continuous stream of test data requires a custom process or script to run continuously. Although the Amazon Kinesis Data Generator (KDG) provides a user-friendly UI for this purpose, it has some limitations, such as bandwidth constraints and increased round trip latency. (For more information on the KDG, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.)

To overcome these limitations, this post describes how to use Locust, a modern load testing framework, to conduct large-scale load testing for a more comprehensive evaluation of the streaming data solution.

Overview

This project emits temperature sensor readings via Locust to Kinesis. We set up the Amazon Elastic Compute Cloud (Amazon EC2) Locust instance via the AWS Cloud Development Kit (AWS CDK) to load test Kinesis-based applications. You can access the Locust dashboard to perform and observe the load test and connect via Session Manager, a capability of AWS Systems Manager, for configuration changes. The following diagram illustrates this architecture.

Architecture overview

In our testing with the largest recommended instance (c7g.16xlarge), the setup was capable of emitting over 1 million events per second to Kinesis data streams in on-demand capacity mode, with a batch size (simulated users per Locust user) of 500. You can find more details on what this means and how to configure the load test later in this post.

Locust overview

Locust is an open-source, scriptable, and scalable performance testing tool that allows you to define user behavior using Python code. It offers an easy-to-use interface, making it developer-friendly and highly expandable. With its distributed and scalable design, Locust can simulate millions of simultaneous users to mimic real user behavior during a performance test.

Each Locust user represents a scenario or a specific set of actions that a real user might perform on your system. When you run a performance test with Locust, you can specify the number of concurrent Locust users you want to simulate, and Locust will create an instance for each user, allowing you to assess the performance and behavior of your system under different user loads.

For more information on Locust, refer to the Locust documentation.

Prerequisites

To get started, clone or download the code from the GitHub repository.

Test locally

To test Locust out locally first before deploying it to the cloud, you have to install the necessary Python dependencies. If you’re new to Python, refer the README for more information on getting started.

Navigate to the load-test directory and run the following code:

pip install -r requirements.txt

To send events to a Kinesis data stream from your local machine, you will need to have AWS credentials. For more information, refer to Configuration and credential file settings.

To perform the test locally, stay in the load-test directory and run the following code:

locust -f locust-load-test.py

You can now access the Locust dashboard via http://0.0.0.0:8089/. Enter the number of Locust users, the spawn rate (users added per second), and the target Amazon Kinesis data stream name for Host. By default, it deploys the Kinesis data stream DemoStream that you can use for testing.

Locust Dashboard - Enter details

To see the generated events logged, run the following command, which filters only Locust and root logs (for example, no Botocore logs):

locust -f locust-load-test.py --loglevel DEBUG 2&gt;&amp;1 | grep -E "(locust|root)"

Set up resources with the AWS CDK

The GitHub repository contains the AWS CDK code to create all the necessary resources for the load test. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

If not already downloaded, clone the GitHub repository to your local computer using the following command:

git clone https://github.com/aws-samples/amazon-kinesis-load-testing-with-locust

Download and install the latest Node.js.
Navigate to the root folder of the project and run the following command to install the latest version of AWS CDK:

npm install -g aws-cdk

Install the necessary dependencies:

npm install

Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command:

cdk bootstrap

To learn more about the bootstrapping process, refer to Bootstrapping.

After the dependencies are installed, you can run the following command to deploy the stack of the AWS CDK template, which sets up the infrastructure within 5 minutes:

cdk deploy

The template sets up the Locust EC2 test instance, which is by default a c7g.xlarge instance, which at the time of publishing costs approximately $0.145 per hour in us-east-1. To find the most accurate pricing information, see Amazon EC2 On-Demand Pricing. You can find more details on how to change your instance size according to your scale of load testing later in this post.

It’s crucial to consider that the expenses incurred during load testing are not solely attributed to EC2 instance costs, but also heavily influenced by data transfer costs.

Accessing the Locust dashboard

You can access the dashboard by using the AWS CDK output KinesisLocustLoadTestingStack.locustdashboardurl to open the dashboard, for example http://1.2.3.4:8089.

The Locust dashboard is password protected. By default, it’s set to user name locust-user and password locust-dashboard-pwd.

With the default configuration, you can achieve up to 15,000 emitted events per second. Enter the number of Locust users (times the batch size), the spawn rate (users added per second), and the target Kinesis data stream name for Host.

Locust Dashboard - Enter details

After you have started the load test, you can look at the load test on the Charts tab.

Locust Dashboard - Charts

You can also monitor the load test on the Kinesis Data Streams console by navigating to the stream that you are load testing. If you used the default settings, navigate to DemoStream. On the detail page, choose the Monitoring tab to see the ingested load.

Kinesis Data Streams - Monitoring

Adapt workloads

By default, this project generates random temperature sensor readings for every sensor with the following format:

{
    "sensorId": "bfbae19c-2f0f-41c2-952b-5d5bc6e001f1_1",
    "temperature": 147.24,
    "status": "OK",
    "timestamp": 1675686126310
}

The project comes packaged with Faker, which you can use to adapt the payload to your needs. You just have to update the generate_sensor_reading function in the locust-load-test.py file:

class SensorAPIUser(KinesisBotoUser):
    # ...

    def generate_sensor_reading(self, sensor_id, sensor_reading):
        current_temperature = round(10 + random.random() * 170, 2)

        if current_temperature > 160:
            status = "ERROR"
        elif current_temperature > 140 or random.randrange(1, 100) > 80:
            status = random.choice(["WARNING", "ERROR"])
        else:
            status = "OK"

        return {
            'sensorId': f"{sensor_id}_{sensor_reading}",
            'temperature': current_temperature,
            'status': status,
            'timestamp': round(time.time()*1000)
        }

    # ...

Change configurations

After the initial deployment of the load testing tool, you can change configuration in two ways:

Connect to the EC2 instance, make any configuration and code changes, and restart the Locust process
Change the configuration and load testing code locally and redeploy it via cdk deploy

The first option helps you iterate more quickly on the remote instance without a need to redeploy. The latter uses the infrastructure as code (IaC) approach and makes sure that your configuration changes can be committed to your source control system. For a fast development cycle, it’s recommended to test your load test configuration locally first, connect to your instance to apply the changes, and after successful implementation, codify it as part of your IaC repository and then redeploy.

Locust is created on the EC2 instance as a systemd service and can therefore be controlled with systemctl. If you want to change the configuration of Locust as needed without redeploying the stack, you can connect to the instance via Systems Manager, navigate to the project directory on /usr/local/load-test, change the locust.env file, and restart the service by running sudo systemctl restart locust.

Large-scale load testing

This setup is capable of emitting over 1 million events per second to Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge.

To achieve peak performance with Locust and Kinesis, keep the following in mind:

Instance size – Your performance is bound by the underlying EC2 instance, so refer to EC2 instance type for more information about scaling. To set the correct instance size, you can configure the instance size in the file kinesis-locust-load-testing.ts.
Number of secondaries – Locust benefits from a distributed setup. Therefore, the setup spins up a primary, which does the coordination, and multiple secondaries, which do the actual work. To fully take advantage of the cores, you should specify one secondary per core. You can configure the number in the locust.env file.
Batch size – The amount of Kinesis data stream events you can send per Locust user is limited due to the resource overhead of switching Locust users and threads. To overcome this, you can configure a batch size to define how much users are simulated per Locust user. These are sent as a Kinesis data stream put_records call. You can configure the number in the locust.env file.

This setup is capable of emitting over 1 million events per second to the Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge instance.

Locust Dashboard - Large Scale Load Test Charts

You can observe this on the Monitoring tab for the Kinesis data stream as well.

Kinesis Data Stream - Large Scale Load Test Monitoring

Clean up

In order to not incur any unnecessary costs, delete the stack by running the following code:

cdk destroy

Summary

Kinesis is already popular for its ease of use among users building streaming applications. With this load testing capability using Locust, you can now test your workloads in a more straightforward and faster way. Visit the GitHub repo to embark on your testing journey.

The project is licensed under the Apache 2.0 license, providing the freedom to clone and modify it according to your needs. Furthermore, you can contribute to the project by submitting issues or pull requests via GitHub, fostering collaboration and improvement in the testing ecosystem.

About the author

Luis Morales works as Senior Solutions Architect with digital native businesses to support them in constantly reinventing themselves in the cloud. He is passionate about software engineering, cloud-native distributed systems, test-driven development, and all things code and security

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

2023-08-03 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account
A basic understanding of Amazon S3
A basic understanding of QuickSight to create dashboards
An AWS Identity and Access Management (IAM) role with permissions to create the Amazon RDS database, AWS DMS instance and tasks, Kinesis data stream, S3 buckets, AWS Glue job, AWS Glue Data Catalog, and QuickSight dashboards, and run SQL queries using Athena (see Adding and removing IAM identity permissions for reference)

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

An RDS database instance (source).
An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
A Kinesis data stream.
Four AWS Glue Python shell jobs:
- rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
- rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
- rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
- rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
AWS Identity and Access Management (IAM) users and policies.
An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
- We use private subnets for the RDS database instance and AWS DMS replication instance.
- We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

IAM roles dms-vpc-role, dms-cloudwatch-logs-role, and dms-access-for-endpoint. If you haven’t used AWS DMS before, you need to create these special IAM roles via the IAM console or the AWS Command Line Interface (AWS CLI). For instructions, see Creating the IAM roles to use with the AWS CLI and AWS DMS API.
If you already deselected Use only IAM access control for new databases and Use only IAM access control for new table in new databases on the AWS Lake Formation console Settings page, you need to select these two check boxes again and save your settings. For more information, see Changing the default settings for your data lake.

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack
Choose Next.
For S3BucketName, enter the name of your new S3 bucket.
For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
For SubnetAzA and SubnetAzB, choose the subnets you want to use.
For DatabaseUserName, enter your database user name.
For DatabaseUserPassword, enter your database user password.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
Choose Run.
Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

On the AWS DMS console, choose Database migration tasks in the navigation pane.
Select the task rds-to-kinesis-<CloudFormation stack name>.
On the Actions menu, choose Restart/Resume.
Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
Choose Run.
Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

On the Athena console, complete the following steps if you’re running an Athena query for the first time:
- On the Settings tab, choose Manage.
- Specify the stage directory and the S3 path where Athena saves the query results.
- Choose Save.

On the Editor tab, run the following query against the table to check the data:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
Choose Run.
Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose the job rds-upsert-data-<CloudFormation stack name>.
On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
- For Key, enter --ticketactivity_id.
- For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

Choose Save.
Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

On the QuickSight console, choose your user name and choose Manage QuickSight.
Choose Security & permissions, then choose Manage.
Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
Select Amazon Athena.
Choose Save.
If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena.
For Data source name, enter a name (for example, hudi-blog).
Choose Validate.
After the validation is successful, choose Create data source.
For Database, choose database_<your-account-number>_hudi_cdc_demo.
For Tables, select ticket_activity.
Choose Select.
Choose Visualize.
Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
Navigate to the RDS database and choose Modify.
Deselect Enable deletion protection, then choose Continue.
Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
Delete the CloudFormation stack.
On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
Choose Account settings, then choose Delete account.
Choose Delete account to confirm.
Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.

About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

Amazon Kinesis Data Streams on-demand capacity mode now scales up to 1 GB/second ingest capacity

2023-08-02 Nihar Sheth

Post Syndicated from Nihar Sheth original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-on-demand-capacity-mode-now-scales-up-to-1-gb-second-ingest-capacity/

Amazon Kinesis Data Streams is a serverless data streaming service that makes it easy to capture, process, and store streaming data at any scale. As customers collect and stream more types of data, they have asked for simpler, elastic data streams that can handle variable and unpredictable data traffic. In November 2021, Amazon Web Services launched the on-demand capacity mode for Kinesis Data Streams, which is capable of serving gigabytes of write and read throughput per minute and helps reduce the operational pain point of manually updating data stream capacity. You can create a new on-demand data stream or convert an existing data stream to on-demand mode with a single click and never have to provision and manage servers, storage, or throughput. By default, on-demand capacity mode can automatically scale up to 200 MB/s of write throughput.

We were encouraged by customers’ adoption of on-demand capacity mode, but as customers scaled their workloads, some ran into the 200 MB/s data ingestion limit and asked for a solution. The team worked backward from customer feedback to raise that limit. As of March 2023, Kinesis Data Streams supports an increased on-demand write throughput limit to 1 GB/s, a five-times increase from the current limit of 200 MB/s. It’s like having a truly serverless and elastic data streaming service that works for all your use cases. If you require an increase in capacity, you can contact AWS Support to enable on-demand streams to scale up to 1 GB/s write throughput for each requested account. You pay for throughput consumed rather than for provisioned resources, making it easier to balance costs and performance. Overall, if your data volume can spike unpredictably or you don’t want to manage the number of shards, use on-demand streams.

In this post, we explore how to use Kinesis Data Streams on-demand scaling and best practices to build an efficient data-streaming solution. We discuss different scenarios to avoid write throughput exceptions and scale ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode.

Kinesis Data Streams on-demand scaling

A shard serves as a base throughput unit of Kinesis Data Streams. A shard supports 1 MB/s and 1,000 records/s for writes and 2 MB/s for reads. The shard limits ensure predictable performance, making it easy to design and operate a highly reliable data streaming workflow. In on-demand capacity mode, scaling happens at the individual shard level. When the average ingest shard utilization reaches 50% (0.5 MB/s or 500 records/s) in 1 minute, then a shard is split into two shards. If you use random values as a partition key, all shards of the stream will have even traffic, and they will be scaled at the same time. If you use a business-specific key as a partition key, the shards will have uneven traffic. In that scenario, only the shards exceeding an average of 50% utilization will be scaled. Depending upon the number of shards being scaled, it will take up to 15 minutes to split the shards.

When we create a new Kinesis data stream in on-demand capacity mode, by default, Kinesis Data Streams provisions four shards, which provides 4 MB/s write and 8 MB/s read throughput. As the workload ramps up, Kinesis Data Streams increases the number of shards in the stream by monitoring ingest throughput at the shard level. The 4 MB/s default ingest throughput and scaling at shard level in on-demand capacity mode works for most use cases. However, in some specific scenarios, producers may face WriteThroughputExceeded and Rate Exceeded errors, even in on-demand capacity mode. We discuss a few of these scenarios in the following sections and strategies to avoid these errors.

You can create and save record templates and easily send data to Kinesis Data Streams using the Amazon Kinesis Data Generator (KDG) to test the streaming data solution. Alternatively, you can also use the modern load testing framework Locust to run large-scale Kinesis Data Streams load testing. For this post, we use the Locust tool to produce and ingest messages in Kinesis Data Streams for our different use cases.

Scenario 1: A baseline ingest throughput greater than 4 MB/s is needed

To simulate this scenario, run the following AWS Command Line Interface (AWS CLI) command to create the kds-od-default-shards data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-default-shards data stream is active, run following AWS CLI command to check the number of shards in the data stream:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

You can observe that the OpenShardCount value is 4, which means the kds-od-default-shards data stream has an ingest capacity of 4 MB/s.

Next, we use the Locust tool to set the baseline to approximately 25 MB/s records. As displayed in the following Amazon CloudWatch metrics graph, records are getting throttled for the first couple of minutes. Then the kds-od-default-shards data stream scales the number of shards to support 25 MB/s ingest throughput, and records stop getting throttled. You can also rerun the describe-stream-summary AWS CLI command to check the increased number of shards in the data stream.

BDB-3047-scenario-1-incoming-data

BDB-3047-scenario-1-record-throttle

In a scenario where we know our ingest throughput baseline (25 MB/s) ahead of the time and we don’t want to observe any write throttles, we can create a stream in provisioned mode by specifying the number of shards (30), as shown in the following AWS CLI command (make sure to delete kds-od-default-shards manually from the Kinesis Data Streams console before running the following command):

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=PROVISIONED --shard-count 30 --region us-east-1

When the kds-od-default-shards data stream is active, run the following AWS CLI command to convert the data stream’s capacity mode to on-demand:

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

Next, we send 25 MB/s records to the kds-od-default-shards data stream. As displayed in the following CloudWatch metrics graph, we can observe no write throttles, and the kds-od-default-shards data stream scales the number of shards to handle the increase in ingest volume.

BDB-3047-scenario-1-incoming-data1

BDB-3047-scenario-1-record-throttle1

After we send 25 MB/s traffic to the data stream for some time, we can run following AWS CLI command to see that the OpenShardCount value is increased to more than 30 now:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

Scenario 2: A significant ingestion spike is expected, which needs ingest throughput greater than the number of shards in the stream

To simulate the scenario, run the following AWS CLI command to create the kds-od-significant-spike data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-significant-spike --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the kds-od-significant-spike data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 4 MB/s ingest throughput initially and grow the ingest throughput by 30–50% every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-significant-spike data stream scales the number of shards to handle the increase in ingest volume.

After approximately 15 minutes, run the following AWS CLI command to find the OpenShardCount value (x) of the kds-od-significant-spike data stream. Then send (x * 2) MB/s ingest throughput in the data stream for 2–3 minutes and reduced ingest throughput to the prior level:

aws kinesis describe-stream-summary --stream-name kds-od-significant-spike --region us-east-1

As displayed in the following CloudWatch metrics graph, the records are getting throttled for a few minutes, and then the throttling goes away.

BDB-3047-scenario-2-incoming-data

BDB-3047-scenario-2-record-throttle

Typically, we face a significant spike scenario when running planned events, such as shopping holidays and product launches. To handle such scenarios, we can proactively change capacity mode from on-demand to provisioned. We can configure the number of shards and pick the ingest capacity we anticipate. After we successfully scale the number of shards to our desired peak capacity in provisioned capacity mode, we can change the capacity mode back to on-demand mode.

Scenario 3: A single partition key starts pushing more than 1 MB/s

Partition keys are used to segregate and route records to different shards of a stream. A partition key is specified by the data producer while adding data to the data stream. For example, let’s assume we have a stream with two shards (shard 1 and shard 2). We can configure the data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2. Choosing a partition key is a very important decision, and we should carefully pick the partition key to ensure equal distribution of records across all the shards of the stream. Messages tied to a single partition key A will be sent to a single shard (shard 1), and at any given instance, messages tied to a single partition key A cannot be distributed across different shards. As mentioned earlier, by default, one shard supports 1 MB/s and 1,000 records/s for writes, and we may end up with an edge case scenario where we are trying to push more than 1 MB/s for a specific partition key. In this scenario, producers will continue to experience throttles and keep retrying indefinitely.

To simulate the scenario, run the following AWS CLI command to create the kds-od-partition-key-throttle data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-partition-key-throttle --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 1.5 MB/s ingest throughput continuously for the specific partition key A. As displayed in the following CloudWatch metrics graph, we can observe that throttling continues from a single shard even if we are sending 1.5 MB/s ingest throughput, and the kds-od-partition-key-throttle data stream has an overall ingest capacity of 4 MB/s.

BDB-3047-scenario-3-incoming-data

BDB-3047-scenario-3-record-throttle

To avoid this scenario, we should carefully pick our partition key and ensure that this specific partition key won’t be continuously sending more than 1 MB/s ingest throughput in the data stream.

Scale the ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode

To test, we start with approximately 100 MB/s baseline ingest throughput to Kinesis Data Streams in on-demand capacity mode, then we increase ingest throughput rate by 30–50% every 5–10 minutes using Locust load testing tool.

To set up the scenario, first create the kds-od-1gb-stream data stream in provisioned capacity mode and provide a value of 120 for the provisioned shards field:

aws kinesis create-stream --stream-name kds-od-1gb-stream --stream-mode-details StreamMode=PROVISIONED --shard-count 120 --region us-east-1

When the kds-od-1gb-stream data stream is active, switch its capacity mode to on-demand, as shown in the following code. When we change capacity mode from provisioned to on-demand, the shard count (120) remains the same for the data stream even in on-demand capacity mode.

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-1gb-stream --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-1gb-stream data stream is in on-demand mode, start the experiment. We send approximately 100 MB/s baseline ingest throughput using the Locust tool and increase 30–50% ingest throughput every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-1gb-stream data stream seamlessly scaled to 1 GB/s in on-demand capacity mode. We can also observe that the producers didn’t encounter any write throttles while the data stream was scaling in on-demand capacity mode.

BDB-3047-scale-to-1-GB

Clean up

To avoid ongoing costs, delete all the data streams that you created as part of this post using the Kinesis Data Streams console.

Conclusion

This post demonstrated the on-demand scaling policy of Kinesis Data Streams with a few scenarios using best practices and showed how to scale ingest capacity to 1 GB/s in on-demand capacity mode. You can have an on-demand write throughput limit that is five times larger than the previous limit of 200 MB/s. Choose on-demand mode if you create new data streams with unknown workloads, have unpredictable application traffic, or prefer not to manage capacity. You can switch between on-demand and provisioned capacity modes two times per 24-hour rolling period. Please leave any feedback in the comments section.

About the Authors

Nihar Sheth is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Nisha Dekhtawala is a Partner Solutions Architect and data analytics specialist. She works with global consulting partners as their trusted advisor, providing technical guidance and support in building Well-Architected innovative industry solutions.

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

2023-07-28 Deepthi Mohan

Post Syndicated from Deepthi Mohan original https://aws.amazon.com/blogs/big-data/a-side-by-side-comparison-of-apache-spark-and-apache-flink-for-common-streaming-use-cases/

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Apache Flink offers layered APIs that offer different levels of expressiveness and control and are designed to target different types of use cases. The three layers of API are Process Functions (also known as the Stateful Stream Processing API), DataStream, and Table and SQL. The Stateful Stream Processing API requires writing verbose code but offers the most control over time and state, which are core concepts in stateful stream processing. The DataStream API supports Java, Scala, and Python and offers primitives for many common stream processing operations, as well as a balance between code verbosity or expressiveness and control. The Table and SQL APIs are relational APIs that offer support for Java, Scala, Python, and SQL. They offer the highest abstraction and intuitive, SQL-like declarative control over data streams. Flink also allows seamless transition and switching across these APIs. To learn more about Flink’s layered APIs, refer to layered APIs.

Apache Spark Structured Streaming offers the Dataset and DataFrames APIs, which provide high-level declarative streaming APIs to represent static, bounded data as well as streaming, unbounded data. Operations are supported in Scala, Java, Python, and R. Spark has a rich function set and syntax with simple constructs for selection, aggregation, windowing, joins, and more. You can also use the Streaming Table API to read tables as streaming DataFrames as an extension to the DataFrames API. Although it’s hard to draw direct parallels between Flink and Spark across all stream processing constructs, at a very high level, we could say Spark Structured Streaming APIs are equivalent to Flink’s Table and SQL APIs. Spark Structured Streaming, however, does not yet (at the time of this writing) offer an equivalent to the lower-level APIs in Flink that offer granular control of time and state.

Both Flink and Spark Structured Streaming (referenced as Spark henceforth) are evolving projects. The following table provides a simple comparison of Flink and Spark capabilities for common streaming primitives (as of this writing).

.	Flink	Spark
Row-based processing	Yes	Yes
User-defined functions	Yes	Yes
Fine-grained access to state	Yes, via DataStream and low-level APIs	No
Control when state eviction occurs	Yes, via DataStream and low-level APIs	No
Flexible data structures for state storage and querying	Yes, via DataStream and low-level APIs	No
Timers for processing and stateful operations	Yes, via low level APIs	No

In the following sections, we cover the greatest common factors so that we can showcase how Spark users can relate to Flink and vice versa. To learn more about Flink’s low-level APIs, refer to Process Function. For the sake of simplicity, we cover the four use cases in this post using the Flink Table API. We use a combination of Python and SQL for an apples-to-apples comparison with Spark.

Data preparation

In this section, we compare data preparation methods for Spark and Flink.

Reading data

We first look at the simplest ways to read data from a data stream. The following sections assume the following schema for messages:

symbol: string,
price: int,
timestamp: timestamp,
company_info:
{
    name: string,
    employees_count: int
}

Reading data from a source in Spark Structured Streaming

In Spark Structured Streaming, we use a streaming DataFrame in Python that directly reads the data in JSON format:

spark = ...  # spark session

# specify schema
stock_ticker_schema = ...

# Create a streaming DataFrame
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "mybroker1:port") \
    .option("topic", "stock_ticker") \
    .load()
    .select(from_json(col("value"), stock_ticker_schema).alias("ticker_data")) \
    .select(col("ticker_data.*"))

Note that we have to supply a schema object that captures our stock ticker schema (stock_ticker_schema). Compare this to the approach for Flink in the next section.

Reading data from a source using Flink Table API

For Flink, we use the SQL DDL statement CREATE TABLE. You can specify the schema of the stream just like you would any SQL table. The WITH clause allows us to specify the connector to the data stream (Kafka in this case), the associated properties for the connector, and data format specifications. See the following code:

# Create table using DDL

CREATE TABLE stock_ticker (
  symbol string,
  price INT,
  timestamp TIMESTAMP(3),
  company_info STRING,
  WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE
) WITH (
 'connector' = 'kafka',
 'topic' = 'stock_ticker',
 'properties.bootstrap.servers' = 'mybroker1:port',
 'properties.group.id' = 'testGroup',
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'
)

JSON flattening

JSON flattening is the process of converting a nested or hierarchical JSON object into a flat, single-level structure. This converts multiple levels of nesting into an object where all the keys and values are at the same level. Keys are combined using a delimiter such as a period (.) or underscore (_) to denote the original hierarchy. JSON flattening is useful when you need to work with a more simplified format. In both Spark and Flink, nested JSONs can be complicated to work with and may need additional processing or user-defined functions to manipulate. Flattened JSONs can simplify processing and improve performance due to reduced computational overhead, especially with operations like complex joins, aggregations, and windowing. In addition, flattened JSONs can help in easier debugging and troubleshooting data processing pipelines because there are fewer levels of nesting to navigate.

JSON flattening in Spark Structured Streaming

JSON flattening in Spark Structured Streaming requires you to use the select method and specify the schema that you need flattened. JSON flattening in Spark Structured Streaming involves specifying the nested field name that you’d like surfaced to the top-level list of fields. In the following example, company_info is a nested field and within company_info, there’s a field called company_name. With the following query, we’re flattening company_info.name to company_name:

stock_ticker_df = ...  # Streaming DataFrame w/ schema shown above

stock_ticker_df.select("symbol", "timestamp", "price", "company_info.name" as "company_name")

JSON flattening in Flink

In Flink SQL, you can use the JSON_VALUE function. Note that you can use this function only in Flink versions equal to or greater than 1.14. See the following code:

SELECT
   symbol,
   timestamp,
   price,
   JSON_VALUE(company_info, 'lax $.name' DEFAULT NULL ON EMPTY) AS company_name
FROM
   stock_ticker

The term lax in the preceding query has to do with JSON path expression handling in Flink SQL. For more information, refer to System (Built-in) Functions.

Data processing

Now that you have read the data, we can look at a few common data processing patterns.

Deduplication

Data deduplication in stream processing is crucial for maintaining data quality and ensuring consistency. It enhances efficiency by reducing the strain on the processing from duplicate data and helps with cost savings on storage and processing.

Spark Streaming deduplication query

The following code snippet is related to a Spark Streaming DataFrame named stock_ticker. The code performs an operation to drop duplicate rows based on the symbol column. The dropDuplicates method is used to eliminate duplicate rows in a DataFrame based on one or more columns.

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

stock_ticker.dropDuplicates("symbol")

Flink deduplication query

The following code shows the Flink SQL equivalent to deduplicate data based on the symbol column. The query retrieves the first row for each distinct value in the symbol column from the stock_ticker stream, based on the ascending order of proctime:

SELECT symbol, timestamp, price
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY proctime ASC) AS row_num
  FROM stock_ticker)
WHERE row_num = 1

Windowing

Windowing in streaming data is a fundamental construct to process data within specifications. Windows commonly have time bounds, number of records, or other criteria. These time bounds bucketize continuous unbounded data streams into manageable chunks called windows for processing. Windows help in analyzing data and gaining insights in real time while maintaining processing efficiency. Analyses or operations are performed on constantly updating streaming data within a window.

There are two common time-based windows used both in Spark Streaming and Flink that we will detail in this post: tumbling and sliding windows. A tumbling window is a time-based window that is a fixed size and doesn’t have any overlapping intervals. A sliding window is a time-based window that is a fixed size and moves forward in fixed intervals that can be overlapping.

Spark Streaming tumbling window query

The following is a Spark Streaming tumbling window query with a window size of 10 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in tumbling window
# of size 10 minutes
visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming tumbling window query

The following is an equivalent tumbling window query in Flink with a window size of 10 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    TUMBLE(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '10' MINUTES))
  GROUP BY ticker;

Spark Streaming sliding window query

The following is a Spark Streaming sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in sliding window
# of size 10 minutes and slide interval of size
# 5 minutes

visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes", "5 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming sliding window query

The following is a Flink sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    HOP(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY ticker;

Handling late data

Both Spark Structured Streaming and Flink support event time processing, where a field within the payload can be used for defining time windows as distinct from the wall clock time of the machines doing the processing. Both Flink and Spark use watermarking for this purpose.

Watermarking is used in stream processing engines to handle delays. A watermark is like a timer that sets how long the system can wait for late events. If an event arrives and is within the set time (watermark), the system will use it to update a request. If it’s later than the watermark, the system will ignore it.

In the preceding windowing queries, you specify the lateness threshold in Spark using the following code:

.withWatermark("timestamp", "3 minutes")

This means that any records that are 3 minutes late as tracked by the event time clock will be discarded.

In contrast, with the Flink Table API, you can specify an analogous lateness threshold directly in the DDL:

WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE

Note that Flink provides additional constructs for specifying lateness across its various APIs.

Data enrichment

In this section, we compare data enrichment methods with Spark and Flink.

Calling an external API

Calling external APIs from user-defined functions (UDFs) is similar in Spark and Flink. Note that your UDF will be called for every record processed, which can result in the API getting called at a very high request rate. In addition, in production scenarios, your UDF code often gets run in parallel across multiple nodes, further amplifying the request rate.

For the following code snippets, let’s assume that the external API call entails calling the function:

response = my_external_api(request)

External API call in Spark UDF

The following code uses Spark:

class Predict(ScalarFunction):
def open(self, function_context):

with open("resources.zip/resources/model.pkl", "rb") as f:
self.model = pickle.load(f)

def eval(self, x):
return self.model.predict(x)

External API call in Flink UDF

For Flink, assume we define the UDF callExternalAPIUDF, which takes as input the ticker symbol symbol and returns enriched information about the symbol via a REST endpoint. We can then register and call the UDF as follows:

callExternalAPIUDF = udf(callExternalAPIUDF(), result_type=DataTypes.STRING())

SELECT
    symbol, 
    callExternalAPIUDF(symbol) as enriched_symbol
FROM stock_ticker;

Flink UDFs provide an initialization method that gets run one time (as opposed to one time per record processed).

Note that you should use UDFs judiciously as an improperly implemented UDF can cause your job to slow down, cause backpressure, and eventually stall your stream processing application. It’s advisable to use UDFs asynchronously to maintain high throughput, especially for I/O-bound use cases or when dealing with external resources like databases or REST APIs. To learn more about how you can use asynchronous I/O with Apache Flink, refer to Enrich your data stream asynchronously using Amazon Kinesis Data Analytics for Apache Flink.

Conclusion

Apache Flink and Apache Spark are both rapidly evolving projects and provide a fast and efficient way to process big data. This post focused on the top use cases we commonly encountered when customers wanted to see parallels between the two technologies for building real-time stream processing applications. We’ve included samples that were most frequently requested at the time of this writing. Let us know if you’d like more examples in the comments section.

About the author

Deepthi Mohan is a Principal Product Manager on the Amazon Kinesis Data Analytics team.

Karthi Thyagarajan was a Principal Solutions Architect on the Amazon Kinesis team.

Near-real-time analytics using Amazon Redshift streaming ingestion with Amazon Kinesis Data Streams and Amazon DynamoDB

2023-07-27 Poulomi Dasgupta

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/near-real-time-analytics-using-amazon-redshift-streaming-ingestion-with-amazon-kinesis-data-streams-and-amazon-dynamodb/

Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the widely used cloud data warehouse. You can run and scale analytics in seconds on all your data without having to manage your data warehouse infrastructure.

You can use the Amazon Redshift streaming ingestion capability to update your analytics databases in near-real time. Amazon Redshift streaming ingestion simplifies data pipelines by letting you create materialized views directly on top of data streams. With this capability in Amazon Redshift, you can use SQL (Structured Query Language) to connect to and directly ingest data from data streams, such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) data streams, and pull data directly to Amazon Redshift.

In this post, we discuss a solution that uses Amazon Redshift streaming ingestion to provide near-real-time analytics.

Overview of solution

We walk through an example pipeline to ingest data from an Amazon DynamoDB source table in near-real time using Kinesis Data Streams in combination with Amazon Redshift streaming ingestion. We also walk through using PartiQL in Amazon Redshift to unnest nested JSON documents and build fact and dimension tables that are used in your data warehouse refresh. The solution uses Kinesis Data Streams to capture item-level changes from an application DynamoDB table.

As shown in the following reference architecture, DynamoDB table data changes are streamed into Amazon Redshift through Kinesis Data Streams and Amazon Redshift streaming ingestion for near-real-time analytics dashboard visualization using Amazon QuickSight.

The process flow includes the following steps:

Create a Kinesis data stream and turn on the data stream from DynamoDB to capture item-level changes in your DynamoDB table.
Create a streaming materialized view in your Amazon Redshift cluster to consume live streaming data from the data stream.
The streaming data gets ingested into a JSON payload. Use a combination of a PartiQL statement and dot notation to unnest the JSON document into data columns of a staging table in Amazon Redshift.
Create fact and dimension tables in the Amazon Redshift cluster and keep loading the latest data at regular intervals from the staging table using transformation logic.
Establish connectivity between a QuickSight dashboard and Amazon Redshift to deliver visualization and insights.

Prerequisites

You must have the following:

An AWS account.
An Amazon Redshift cluster if you are using Amazon Redshift Provisioned. For instructions, refer to Create a sample Amazon Redshift cluster.
An Amazon Redshift workgroup if you are using Amazon Redshift Serverless. For instructions, refer to Create a workgroup with a namespace.
An existing DynamoDB table with an active workload.

Set up a Kinesis data stream

To configure your Kinesis data stream, complete the following steps:

Create a Kinesis data stream called demo-data-stream. For instructions, refer to Step 1 in Set up streaming ETL pipelines.

Configure the stream to capture changes from the DynamoDB table.

On the DynamoDB console, choose Tables in the navigation pane.
Open your table.
On the Exports and streams tab, choose Turn on under Amazon Kinesis data stream details.

For Destination Kinesis data stream, choose demo-data-stream.
Choose Turn on stream.

Item-level changes in the DynamoDB table should now be flowing to the Kinesis data stream.

To verify if the data is entering the stream, on the Kinesis Data Streams console, open demo-data-stream.
On the Monitoring tab, find the PutRecord success – average (Percent) and PutRecord – sum (Bytes) metrics to validate record ingestion.

Set up streaming ingestion

To set up streaming ingestion, complete the following steps:

Set up the AWS Identity and Access Management (IAM) role and trust policy required for streaming ingestion. For instructions, refer to Steps 1 and 2 in Getting started with streaming ingestion from Amazon Kinesis Data Streams.
Launch the Query Editor v2 from the Amazon Redshift console or use your preferred SQL client to connect to your Amazon Redshift cluster for the next steps.
Create an external schema:

CREATE EXTERNAL SCHEMA demo_schema
FROM KINESIS
IAM_ROLE { default | 'iam-role-arn' };

To use case-sensitive identifiers, set enable_case_sensitive_identifier to true at either the session or cluster level.
Create a materialized view to consume the stream data and store stream records in semi-structured SUPER format:

CREATE MATERIALIZED VIEW demo_stream_vw AS
    SELECT approximate_arrival_timestamp,
    partition_key,
    shard_id,
    sequence_number,
    json_parse(kinesis_data) as payload    
    FROM demo_schema."demo-data-stream";

Refresh the view, which triggers Amazon Redshift to read from the stream and load data into the materialized view:

REFRESH MATERIALIZED VIEW demo_stream_vw;

You can also set your streaming materialized view to use auto refresh capabilities. This will automatically refresh your materialized view as data arrives in the stream. See CREATE MATERIALIZED VIEW for instructions on how to create a materialized view with auto refresh.

Unnest the JSON document

The following is a sample of a JSON document that was ingested from the Kinesis data stream to the payload column of the streaming materialized view demo_stream_vw:

{
  "awsRegion": "us-east-1",
  "eventID": "6d24680a-6d12-49e2-8a6b-86ffdc7306c1",
  "eventName": "INSERT",
  "userIdentity": null,
  "recordFormat": "application/json",
  "tableName": "sample-dynamoDB",
  "dynamodb": {
    "ApproximateCreationDateTime": 1657294923614,
    "Keys": {
      "pk": {
        "S": "CUSTOMER#CUST_123"
      },
      "sk": {
        "S": "TRANSACTION#2022-07-08T23:59:59Z#CUST_345"
      }
    },
    "NewImage": {
      "completionDateTime": {
        "S": "2022-07-08T23:59:59Z"
      },
      "OutofPockPercent": {
        "N": 50.00
      },
      "calculationRequirements": {
        "M": {
          "dependentIds": {
            "L": [
              {
                "M": {
                  "sk": {
                    "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_567"
                  },
                  "pk": {
                    "S": "CUSTOMER#CUST_123"
                  }
                }
              },
              {
                "M": {
                  "sk": {
                    "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_890"
                  },
                  "pk": {
                    "S": "CUSTOMER#CUST_123"
                  }
                }
              }
            ]
          }
        }
      },
      "Event": {
        "S": "SAMPLE"
      },
      "Provider": {
        "S": "PV-123"
      },
      "OutofPockAmount": {
        "N": 1000
      },
      "lastCalculationDateTime": {
        "S": "2022-07-08T00:00:00Z"
      },
      "sk": {
        "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_567"
      },
      "OutofPockMax": {
        "N": 2000
      },
      "pk": {
        "S": "CUSTOMER#CUST_123"
      }
    },
    "SizeBytes": 694
  },
  "eventSource": "aws:dynamodb"
}

We can use dot notation to unnest the JSON document. But in addition to that, we should use a PartiQL statement to handle arrays if applicable. For example, in the preceding JSON document, there is an array under the element:

"dynamodb"."NewImage"."calculationRequirements"."M"."dependentIds"."L".

The following SQL query uses a combination of dot notation and a PartiQL statement to unnest the JSON document:

select 
substring(a."payload"."dynamodb"."Keys"."pk"."S"::varchar, position('#' in "payload"."dynamodb"."Keys"."pk"."S"::varchar)+1) as Customer_ID,
substring(a."payload"."dynamodb"."Keys"."sk"."S"::varchar, position('#TRANSACTION' in "payload"."dynamodb"."Keys"."sk"."S"::varchar)+1) as Transaction_ID,
substring(b."M"."sk"."S"::varchar, position('#CUSTOMER' in b."M"."sk"."S"::varchar)+1) Dependent_ID,
a."payload"."dynamodb"."NewImage"."OutofPockMax"."N"::int as OutofPocket_Max,
a."payload"."dynamodb"."NewImage"."OutofPockPercent"."N"::decimal(5,2) as OutofPocket_Percent,
a."payload"."dynamodb"."NewImage"."OutofPockAmount"."N"::int as OutofPock_Amount,
a."payload"."dynamodb"."NewImage"."Provider"."S"::varchar as Provider,
a."payload"."dynamodb"."NewImage"."completionDateTime"."S"::timestamptz as Completion_DateTime,
a."payload"."eventName"::varchar Event_Name,
a.approximate_arrival_timestamp
from demo_stream_vw a
left outer join a."payload"."dynamodb"."NewImage"."calculationRequirements"."M"."dependentIds"."L" b on true;

The query unnests the JSON document to the following result set.

Precompute the result set using a materialized view

Optionally, to precompute and store the unnested result set from the preceding query, you can create a materialized view and schedule it to refresh at regular intervals. In this post, we maintain the preceding unnested data in a materialized view called mv_demo_super_unnest, which will be refreshed at regular intervals and used for further processing.

To capture the latest data from the DynamoDB table, the Amazon Redshift streaming materialized view needs to be refreshed at regular intervals, and then the incremental data should be transformed and loaded into the final fact and dimension table. To avoid reprocessing the same data, a metadata table can be maintained at Amazon Redshift to keep track of each ELT process with status, start time, and end time, as explained in the following section.

Maintain an audit table in Amazon Redshift

The following is a sample DDL of a metadata table that is maintained for each process or job:

create table MetaData_ETL
(
JobName varchar(100),
StartDate timestamp, 
EndDate timestamp, 
Status varchar(50)
);

The following is a sample initial entry of the metadata audit table that can be maintained at job level. The insert statement is the initial entry for the ELT process to load the Customer_Transaction_Fact table:

insert into MetaData_ETL 
values
('Customer_Transaction_Fact_Load', current_timestamp, current_timestamp,'Ready' );

Build a fact table with the latest data

In this post, we demonstrate the loading of a fact table using specific transformation logic. We are skipping the dimension table load, which uses similar logic.

As a prerequisite, create the fact and dimension tables in a preferred schema. In following example, we create the fact table Customer_Transaction_Fact in Amazon Redshift:

CREATE TABLE public.Customer_Transaction_Fact (
Transaction_ID character varying(500),
Customer_ID character varying(500),
OutofPocket_Percent numeric(5,2),
OutofPock_Amount integer,
OutofPocket_Max integer,
Provider character varying(500),
completion_datetime timestamp
);

Transform data using a stored procedure

We load this fact table from the unnested data using a stored procedure. For more information, refer to Creating stored procedures in Amazon Redshift.

Note that in this sample use case, we are using transformation logic to identify and load the latest value of each column for a customer transaction.

The stored procedure contains the following components:

In the first step of the stored procedure, the job entry in the MetaData_ETL table is updated to change the status to Running and StartDate as the current timestamp, which indicates that the fact load process is starting.
Refresh the materialized view mv_demo_super_unnest, which contains the unnested data.
In the following example, we load the fact table Customer_Transaction_Fact using the latest data from the streaming materialized view based on the column approximate_arrival_timestamp, which is available as a system column in the streaming materialized view. The value of approximate_arrival_timestamp is set when a Kinesis data stream successfully receives and stores a record.
The following logic in the stored procedure checks if the approximate_arrival_timestamp in mv_demo_super_unnest is greater than the EndDate timestamp in the MetaData_ETL audit table, so that it can only process the incremental data.
Additionally, while loading the fact table, we identify the latest non-null value of each column for every Transaction_ID depending on the order of the approximate_arrival_timestamp column using the rank and min
The transformed data is loaded into the intermediate staging table
The impacted records with the same Transaction_ID values are deleted and reloaded into the Customer_Transaction_Fact table from the staging table
In the last step of the stored procedure, the job entry in the MetaData_ETL table is updated to change the status to Complete and EndDate as the current timestamp, which indicates that the fact load process has completed successfully.

See the following code:

CREATE OR REPLACE PROCEDURE SP_Customer_Transaction_Fact()
AS $$
BEGIN

set enable_case_sensitive_identifier to true;

--Update metadata audit table entry to indicate that the fact load process is running
update MetaData_ETL
set status = 'Running',
StartDate = getdate()
where JobName = 'Customer_Transaction_Fact_Load';

refresh materialized view mv_demo_super_unnest;

drop table if exists Customer_Transaction_Fact_Stg;

--Create latest record by Merging records based on approximate_arrival_timestamp
create table Customer_Transaction_Fact_Stg as
select 
m.Transaction_ID,
min(case when m.rank_Customer_ID =1 then m.Customer_ID end) Customer_ID,
min(case when m.rank_OutofPocket_Percent =1 then m.OutofPocket_Percent end) OutofPocket_Percent,
min(case when m.rank_OutofPock_Amount =1 then m.OutofPock_Amount end) OutofPock_Amount,
min(case when m.rank_OutofPocket_Max =1 then m.OutofPocket_Max end) OutofPocket_Max,
min(case when m.rank_Provider =1 then m.Provider end) Provider,
min(case when m.rank_Completion_DateTime =1 then m.Completion_DateTime end) Completion_DateTime
from
(
select *,
rank() over(partition by Transaction_ID order by case when mqp.Customer_ID is not null then 1 end, approximate_arrival_timestamp desc ) rank_Customer_ID,
rank() over(partition by Transaction_ID order by case when mqp.OutofPocket_Percent is not null then 1 end, approximate_arrival_timestamp desc ) rank_OutofPocket_Percent,
rank() over(partition by Transaction_ID order by case when mqp.OutofPock_Amount is not null then 1 end, approximate_arrival_timestamp  desc )  rank_OutofPock_Amount,
rank() over(partition by Transaction_ID order by case when mqp.OutofPocket_Max is not null then 1 end, approximate_arrival_timestamp desc ) rank_OutofPocket_Max,
rank() over(partition by Transaction_ID order by case when mqp.Provider is not null then 1 end, approximate_arrival_timestamp  desc ) rank_Provider,
rank() over(partition by Transaction_ID order by case when mqp.Completion_DateTime is not null then 1 end, approximate_arrival_timestamp desc )  rank_Completion_DateTime
from mv_demo_super_unnest mqp
where upper(mqp.event_Name) <> 'REMOVE' and mqp.approximate_arrival_timestamp > (select mde.EndDate from MetaData_ETL mde where mde.JobName = 'Customer_Transaction_Fact_Load') 
) m
group by m.Transaction_ID 
order by m.Transaction_ID
;

--Delete only impacted Transaction_ID from Fact table
delete from Customer_Transaction_Fact  
where Transaction_ID in ( select mqp.Transaction_ID from Customer_Transaction_Fact_Stg mqp);

--Insert latest records from staging table to Fact table
insert into Customer_Transaction_Fact
select * from Customer_Transaction_Fact_Stg; 

--Update metadata audit table entry to indicate that the fact load process is completed
update MetaData_ETL
set status = 'Complete',
EndDate = getdate()
where JobName = 'Customer_Transaction_Fact_Load';
END;
$$ LANGUAGE plpgsql;

Additional considerations for implementation

There are several additional capabilities that you could utilize to modify this solution to meet your needs. Many customers utilize multiple AWS accounts, and it’s common that the Kinesis data stream may be in a different AWS account than the Amazon Redshift data warehouse. If this is the case, you can utilize an Amazon Redshift IAM role that assumes a role in the Kinesis data stream AWS account in order to read from the data stream. For more information, refer to Cross-account streaming ingestion for Amazon Redshift.

Another common use case is that you need to schedule the refresh of your Amazon Redshift data warehouse jobs so that the data warehouse’s data is continuously updated. To do this, you can utilize Amazon EventBridge to schedule the jobs in your data warehouse to run on a regular basis. For more information, refer to Creating an Amazon EventBridge rule that runs on a schedule. Another option is to use Amazon Redshift Query Editor v2 to schedule the refresh. For details, refer to Scheduling a query with query editor v2.

If you have a requirement to load data from a DynamoDB table with existing data, refer to Loading data from DynamoDB into Amazon Redshift.

For more information on Amazon Redshift streaming ingestion capabilities, refer to Real-time analytics with Amazon Redshift streaming ingestion.

Clean up

To avoid unnecessary charges, clean up any resources that you built as part of this architecture that are no longer in use. This includes dropping the materialized view, stored procedure, external schema, and tables created as part of this post. Additionally, make sure you delete the DynamoDB table and delete the Kinesis data stream.

Conclusion

After following the solution in this post, you’re now able to build near-real-time analytics using Amazon Redshift streaming ingestion. We showed how you can ingest data from a DynamoDB source table using a Kinesis data stream in order to refresh your Amazon Redshift data warehouse. With the capabilities presented in this post, you should be able to increase the refresh rate of your Amazon Redshift data warehouse in order to provide the most up-to-date data in your data warehouse for your use case.

About the authors

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Matt Nispel is an Enterprise Solutions Architect at AWS. He has more than 10 years of experience building cloud architectures for large enterprise companies. At AWS, Matt helps customers rearchitect their applications to take full advantage of the cloud. Matt lives in Minneapolis, Minnesota, and in his free time enjoys spending time with friends and family.

Dan Dressel is a Senior Analytics Specialist Solutions Architect at AWS. He is passionate about databases, analytics, machine learning, and architecting solutions. In his spare time, he enjoys spending time with family, nature walking, and playing foosball.

Best practices for implementing event-driven architectures in your organization

2023-07-24 Emanuele Levi

Post Syndicated from Emanuele Levi original https://aws.amazon.com/blogs/architecture/best-practices-for-implementing-event-driven-architectures-in-your-organization/

Event-driven architectures (EDA) are made up of components that detect business actions and changes in state, and encode this information in event notifications. Event-driven patterns are becoming more widespread in modern architectures because:

they are the main invocation mechanism in serverless patterns.
they are the preferred pattern for decoupling microservices, where asynchronous communications and event persistence are paramount.
they are widely adopted as a loose-coupling mechanism between systems in different business domains, such as third-party or on-premises systems.

Event-driven patterns have the advantage of enabling team independence through the decoupling and decentralization of responsibilities. This decentralization trend in turn, permits companies to move with unprecedented agility, enhancing feature development velocity.

In this blog, we’ll explore the crucial components and architectural decisions you should consider when adopting event-driven patterns, and provide some guidance on organizational structures.

Division of responsibilities

The communications flow in EDA (see What is EDA?) is initiated by the occurrence of an event. Most production-grade event-driven implementations have three main components, as shown in Figure 1: producers, message brokers, and consumers.

Figure 1. Three main components of an event-driven architecture

Producers, message brokers, and consumers typically assume the following roles:

Producers

Producers are responsible for publishing the events as they happen. They are the owners of the event schema (data structure) and semantics (meaning of the fields, such as the meaning of the value of an enum field). As this is the only contract (coupling) between producers and the downstream components of the system, the schema and its semantics are crucial in EDA. Producers are responsible for implementing a change management process, which involves both non-breaking and breaking changes. With introduction of breaking changes, consumers are able to negotiate the migration process with producers.

Producers are “consumer agnostic”, as their boundary of responsibility ends when an event is published.

Message brokers

Message brokers are responsible for the durability of the events, and will keep an event available for consumption until it is successfully processed. Message brokers ensure that producers are able to publish events for consumers to consume, and they regulate access and permissions to publish and consume messages.

Message brokers are largely “events agnostic”, and do not generally access or interpret the event content. However, some systems provide a routing mechanism based on the event payload or metadata.

Consumers

Consumers are responsible for consuming events, and own the semantics of the effect of events. Consumers are usually bounded to one business context. This means the same event will have different effect semantics for different consumers. Crucial architectural choices when implementing a consumer involve the handling of unsuccessful message deliveries or duplicate messages. Depending on the business interpretation of the event, when recovering from failure a consumer might permit duplicate events, such as with an idempotent consumer pattern.

Crucially, consumers are “producer agnostic”, and their boundary of responsibility begins when an event is ready for consumption. This allows new consumers to onboard into the system without changing the producer contracts.

Team independence

In order to enforce the division of responsibilities, companies should organize their technical teams by ownership of producers, message brokers, and consumers. Although the ownership of producers and consumers is straightforward in an EDA implementation, the ownership of the message broker may not be. Different approaches can be taken to identify message broker ownership depending on your organizational structure.

Decentralized ownership

Figure 2. Ownership of the message broker in a decentralized ownership organizational structure

In a decentralized ownership organizational structure (see Figure 2), the teams producing events are responsible for managing their own message brokers and the durability and availability of the events for consumers.

The adoption of topic fanout patterns based on Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) (see Figure 3), can help companies implement a decentralized ownership pattern. A bus-based pattern using Amazon EventBridge can also be similarly utilized (see Figure 4).

Figure 3. Topic fanout pattern based on Amazon SQS and Amazon SNS

Figure 4. Events bus pattern based on Amazon EventBridge

The decentralized ownership approach has the advantage of promoting team independence, but it is not a fit for every organization. In order to be implemented effectively, a well-established DevOps culture is necessary. In this scenario, the producing teams are responsible for managing the message broker infrastructure and the non-functional requirements standards.

Centralized ownership

Figure 5. Ownership of the message broker in a centralized ownership organizational structure

In a centralized ownership organizational structure, a central team (we’ll call it the platform team) is responsible for the management of the message broker (see Figure 5). Having a specialized platform team offers the advantage of standardized implementation of non-functional requirements, such as reliability, availability, and security. One disadvantage is that the platform team is a single point of failure in both the development and deployment lifecycle. This could become a bottleneck and put team independence and operational efficiency at risk.

Figure 6. Streaming pattern based on Amazon MSK and Kinesis Data Streams

On top of the implementation patterns mentioned in the previous section, the presence of a dedicated team makes it easier to implement streaming patterns. In this case, a deeper understanding on how the data is partitioned and how the system scales is required. Streaming patterns can be implemented using services such as Amazon Managed Streaming for Apache Kafka (MSK) or Amazon Kinesis Data Streams (see Figure 6).

Best practices for implementing event-driven architectures in your organization

The centralized and decentralized ownership organizational structures enhance team independence or standardization of non-functional requirements respectively. However, they introduce possible limits to the growth of the engineering function in a company. Inspired by the two approaches, you can implement a set of best practices which are aimed at minimizing those limitations.

Figure 7. Best practices for implementing event-driven architectures

Introduce a cloud center of excellence (CCoE). A CCoE standardizes non-functional implementation across engineering teams. In order to promote a strong DevOps culture, the CCoE should not take the form of an external independent team, but rather be a collection of individual members representing the various engineering teams.
Decentralize team ownership. Decentralize ownership and maintenance of the message broker to producing teams. This will maximize team independence and agility. It empowers the team to use the right tool for the right job, as long as they conform to the CCoE guidelines.
Centralize logging standards and observability strategies. Although it is a best practice to decentralize team ownership of the components of an event-driven architecture, logging standards and observability strategies should be centralized and standardized across the engineering function. This centralization provides for end-to-end tracing of requests and events, which are powerful diagnosis tools in case of any failure.

Conclusion

In this post, we have described the main architectural components of an event-driven architecture, and identified the ownership of the message broker as one of the most important architectural choices you can make. We have described a centralized and decentralized organizational approach, presenting the strengths of the two approaches, as well as the limits they impose on the growth of your engineering organization. We have provided some best practices you can implement in your organization to minimize these limitations.

Further reading:
To start your journey building event-driven architectures in AWS, explore the following:

For bus-based and topic-based patterns, see What is an Event-Driven Architecture?
For streaming patterns, see What is Apache Kafka?, and Amazon Kinesis.

Temporal data lake architecture for benchmark and indices analytics

2023-07-21 Krishna Gogineni

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/architecture/temporal-data-lake-architecture-for-benchmark-and-indices-analytics/

Financial trading houses and stock exchanges generate enormous volumes of data in near real-time, making it difficult to perform bi-temporal calculations that yield accurate results. Achieving this requires a processing architecture that can handle large volumes of data during peak bursts, meet strict latency requirements, and scale according to incoming volumes.

In this post, we’ll describe a scenario for an industry leader in the financial services sector and explain how AWS services are used for bi-temporal processing with state management and scale based on variable workloads during the day, all while meeting strict service-level agreement (SLA) requirements.

Problem statement

To design and implement a fully temporal transactional data lake with the repeatable read isolation level for queries is a challenge, particularly with burst events that need the overall architecture to scale accordingly. The data store in the overall architecture needs to record the value history of data at different times, which is especially important for financial data. Financial data can include corporate actions, annual or quarterly reports, or fixed-income securities, like bonds that have variable rates. It’s crucial to be able to correct data inaccuracies during the reporting period.

The example customer seeks a data processing platform architecture to dynamically scale based on the workloads with a capacity of processing 150 million records under 5 minutes. Their platform should be capable of meeting the end-to-end SLA of 15 minutes, from ingestion to reporting, with lowest total cost of ownership. Additionally, managing bi-temporal data requires a database that has critical features, such as ACID (atomicity, consistency, isolation, durability) compliance, time-travel capability, full-schema evolution, partition layout and evolution, rollback to prior versions, and SQL-like query experience.

Solution overview

The solution architecture key building blocks are Amazon Kinesis Data Streams for streaming data, Amazon Kinesis Data Analytics with Apache Flink as processing engine, Flink’s RocksDB for state management, and Apache Iceberg on Amazon Simple Storage Service (Amazon S3) as the storage engine (Figure 1).

Figure 1. End-to-end data-processing architecture

Data processing

Here’s how it works:

A publisher application receives the data from the source systems and publishes data into Kinesis Data Streams using a well-defined JSON format structure.
Kinesis Data Streams holds the data for a duration that is configurable so data is not lost and can auto scale based on the data volume ingested.
Kinesis Data Analytics runs an Apache Flink application, with state management (RocksDB), to handle bi-temporal calculations. The Apache Flink application consumes data from Kinesis Data Streams and performs the following computations:
- Transforms the JSON stream into a row-type record, compatible with a SQL table-like structure, resolving nesting and parent–child relationships present within the stream
- Checks whether the record has already an existing state in in-memory RocksDB or disk attached to Kinesis Data Analytics computational node to avoid read latency from the database, which is critical for meeting the performance requirements
- Performs bi-temporal calculations and creates the resultant records in an in-memory data structure before invoking the Apache Iceberg sink operator
- The Apache Flink application sink operator appends the temporal states, expressed as records into existing Apache Iceberg data store. This will comply with key principles of time series data, which is immutable, and the ability to time-travel along with ACID compliance, schema evolution, and partition evolution
Kinesis Data Analytics is resilient and provides a no-data-loss capability, with features like periodic checkpoints and savepoints. They are used to store the state management in a secure Amazon S3 location that can be accessed outside of Kinesis Data Analytics. This savepoints mechanism can be used to programmatically to scale the cluster size based on the workloads using time-driven scheduling and AWS Lambda functions.
If the time-to-live feature of RocksDB is implemented, old records are stored in Apache Iceberg on Amazon S3. When performing temporal calculations, if the state is not found in memory, data is read from Apache Iceberg into RocksDB and the processing is completed. However, this step is optional and can be circumvented if the Kinesis Data Analytics cluster is initialized with right number of Kinesis processing units to hold the historical information, as per requirements.
Because the data is stored in an Apache Iceberg table format in Amazon S3, data is queried using Trino, which supports Apache Iceberg table format.
The end user queries data using any SQL tool that supports the Trino query engine.

Apache Iceberg maintenance jobs, such as data compaction, expire snapshot, delete orphan files, can be launched using Amazon Athena to optimize performance out of Apache Iceberg data store. Details of each processing step performed in Apache Flink application are captured using Amazon CloudWatch, which logs all the events.

Scalability

Amazon EventBridge scheduler invokes a Lambda function to scale the Kinesis Data Analytics. Kinesis Data Analytics has a short outage during rescaling that is proportional to the amount of data stored in RocksDB, which is why a state management strategy is necessary for the proper operation of the system.

Figure 2 shows the scaling process, which depicts:

Before peak load: The Kinesis Data Analytics cluster is processing off-peak records with minimum configuration before the peak load. A scheduled event is launched from EventBridge that invokes a Lambda function, which shuts down the cluster using the savepoint mechanism and scales up the Kinesis Data Analytics cluster to required Kinesis processing units.
During peak load: When the peak data burst happens, the Kinesis Data Analytics cluster is ready to handle the volume of data from Kinesis Data Stream, and processes it within the SLA of 5 minutes.
After peak load: A scheduled event from EventBridge invokes a Lambda function to scale down the Kinesis Data Analytics cluster to the minimum configuration that holds the required state for the entire volume of records.

Figure 2. Cluster scaling before, during, and after peak data volume processing

Performance insights

With the discussed architecture, we want to demonstrate that the we are able to meet the SLAs, in terms of performance and processing times. We have taken a subset of benchmarks and indices data and processed the same with the end-to-end architecture. During the process, we observed some very interesting findings, which we would like to share.

Processing time for Apache Iceberg Upsert vs Append operations: During our tests, we expected Upsert operation to be faster than append. But on the contrary, we noticed that Append operations were faster compared to Upsert even though more computations are performed in the Apache Flink application. In our test with 3,500,000 records, Append operation took 1556 seconds while Upsert took 1675 seconds to process the data (Figure 3).

Figure 3. Processing times for Upsert vs. Append

Compute consumption for Apache Iceberg Upsert vs. Append operations: Comparing the compute consumption for 10,000,000 records, we noticed that Append operation was able to process the data in the same amount of time as Upsert operation but with less compute resources. In our tests, we have noted that Append operation only consumed 64 Kinesis processing units, whereas Upsert consumed 78 Kinesis processing units (Figure 4).

Figure 4. Comparing consumption for Upsert vs. Append

Scalability vs performance: To achieve the desired data processing performance, we need a specific configuration of Kinesis processing units, Kinesis Data Streams, and Iceberg parallelism. In our test with the data that we chose, we started with four Kinesis processing units and four Kinesis data streams for data processing. We observed an 80% performance improvement in data processing with 16 Kinesis data processing units. An additional 6% performance improvement was demonstrated when we scaled to 32 Kinesis processing units. When we increased the Kinesis data streams to 16, we observed an additional 2% performance improvement (Figure 5).

Figure 5. Scalability vs. performance

Data volume processing times for Upsert vs. Append: For this test, we started with 350,000 records of data. When we increased data volume to 3.5M records, we observed that Append performing better than Upsert, demonstrating a five-fold increase in processing time (Figure 6).

Figure 6. Data volume processing times for Upsert vs. Append

Conclusion

The architecture we explored today scales based on the data-volume requirements of the customer and is capable of meeting the end-to-end SLA of 15 minutes, with a potential lowered total cost of ownership. Additionally, the solution is capable of handling high-volume, bi-temporal computations with ACID compliance, time travel, full-schema evolution, partition layout evolution, rollback to prior versions and SQL-like query experience.

Migrate from Amazon Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics Studio

2023-06-29 Nicholas Tunney

Post Syndicated from Nicholas Tunney original https://aws.amazon.com/blogs/big-data/migrate-from-amazon-kinesis-data-analytics-for-sql-applications-to-amazon-kinesis-data-analytics-studio/

Amazon Kinesis Data Analytics makes it easy to transform and analyze streaming data in real time.

In this post, we discuss why AWS recommends moving from Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics for Apache Flink to take advantage of Apache Flink’s advanced streaming capabilities. We also show how to use Kinesis Data Analytics Studio to test and tune your analysis before deploying your migrated applications. If you don’t have any Kinesis Data Analytics for SQL applications, this post still provides a background on many of the use cases you’ll see in your data analytics career and how Amazon Data Analytics services can help you achieve your objectives.

Kinesis Data Analytics for Apache Flink is a fully managed Apache Flink service. You only need to upload your application JAR or executable, and AWS will manage the infrastructure and Flink job orchestration. To make things simpler, Kinesis Data Analytics Studio is a notebook environment that uses Apache Flink and allows you to query data streams and develop SQL queries or proof of concept workloads before scaling your application to production in minutes.

We recommend that you use Kinesis Data Analytics for Apache Flink or Kinesis Data Analytics Studio over Kinesis Data Analytics for SQL. This is because Kinesis Data Analytics for Apache Flink and Kinesis Data Analytics Studio offer advanced data stream processing features, including exactly-once processing semantics, event time windows, extensibility using user-defined functions (UDFs) and custom integrations, imperative language support, durable application state, horizontal scaling, support for multiple data sources, and more. These are critical for ensuring accuracy, completeness, consistency, and reliability of data stream processing and are not available with Kinesis Data Analytics for SQL.

Solution overview

For our use case, we use several AWS services to stream, ingest, transform, and analyze sample automotive sensor data in real time using Kinesis Data Analytics Studio. Kinesis Data Analytics Studio allows us to create a notebook, which is a web-based development environment. With notebooks, you get a simple interactive development experience combined with the advanced capabilities provided by Apache Flink. Kinesis Data Analytics Studio uses Apache Zeppelin as the notebook, and uses Apache Flink as the stream processing engine. Kinesis Data Analytics Studio notebooks seamlessly combine these technologies to make advanced analytics on data streams accessible to developers of all skill sets. Notebooks are provisioned quickly and provide a way for you to instantly view and analyze your streaming data. Apache Zeppelin provides your Studio notebooks with a complete suite of analytics tools, including the following:

Data visualization
Exporting data to files
Controlling the output format for easier analysis
Ability to turn the notebook into a scalable, production application

Unlike Kinesis Data Analytics for SQL Applications, Kinesis Data Analytics for Apache Flink adds the following SQL support:

Joining stream data between multiple Kinesis data streams, or between a Kinesis data stream and an Amazon Managed Streaming for Apache Kafka (Amazon MSK) topic
Real-time visualization of transformed data in a data stream
Using Python scripts or Scala programs within the same application
Changing offsets of the streaming layer

Another benefit of Kinesis Data Analytics for Apache Flink is the improved scalability of the solution once deployed, because you can scale the underlying resources to meet demand. In Kinesis Data Analytics for SQL Applications, scaling is performed by adding more pumps to persuade the application into adding more resources.

In our solution, we create a notebook to access automotive sensor data, enrich the data, and send the enriched output from the Kinesis Data Analytics Studio notebook to an Amazon Kinesis Data Firehose delivery stream for delivery to an Amazon Simple Storage Service (Amazon S3) data lake. This pipeline could further be used to send data to Amazon OpenSearch Service or other targets for additional processing and visualization.

Kinesis Data Analytics for SQL Applications vs. Kinesis Data Analytics for Apache Flink

In our example, we perform the following actions on the streaming data:

Connect to an Amazon Kinesis Data Streams data stream.
View the stream data.
Transform and enrich the data.
Manipulate the data with Python.
Restream the data to a Firehose delivery stream.

To compare Kinesis Data Analytics for SQL Applications with Kinesis Data Analytics for Apache Flink, let’s first discuss how Kinesis Data Analytics for SQL Applications works.

At the root of a Kinesis Data Analytics for SQL application is the concept of an in-application stream. You can think of the in-application stream as a table that holds the streaming data so you can perform actions on it. The in-application stream is mapped to a streaming source such as a Kinesis data stream. To get data into the in-application stream, first set up a source in the management console for your Kinesis Data Analytics for SQL application. Then, create a pump that reads data from the source stream and places it into the table. The pump query runs continuously and feeds the source data into the in-application stream. You can create multiple pumps from multiple sources to feed the in-application stream. Queries are then run on the in-application stream, and results can be interpreted or sent to other destinations for further processing or storage.

The following SQL demonstrates setting up an in-application stream and pump:

CREATE OR REPLACE STREAM "TEMPSTREAM" ( 
   "column1" BIGINT NOT NULL, 
   "column2" INTEGER, 
   "column3" VARCHAR(64));

CREATE OR REPLACE PUMP "SAMPLEPUMP" AS 
INSERT INTO "TEMPSTREAM" ("column1", 
                          "column2", 
                          "column3") 
SELECT STREAM inputcolumn1, 
      inputcolumn2, 
      inputcolumn3
FROM "INPUTSTREAM";

Data can be read from the in-application stream using a SQL SELECT query:

SELECT *
FROM "TEMPSTREAM"

When creating the same setup in Kinesis Data Analytics Studio, you use the underlying Apache Flink environment to connect to the streaming source, and create the data stream in one statement using a connector. The following example shows connecting to the same source we used before, but using Apache Flink:

CREATE TABLE `MY_TABLE` ( 
   "column1" BIGINT NOT NULL, 
   "column2" INTEGER, 
   "column3" VARCHAR(64)
) WITH (
   'connector' = 'kinesis',
   'stream' = sample-kinesis-stream',
   'aws.region' = 'aws-kinesis-region',
   'scan.stream.initpos' = 'LATEST',
   'format' = 'json'
 );

MY_TABLE is now a data stream that will continually receive the data from our sample Kinesis data stream. It can be queried using a SQL SELECT statement:

SELECT column1, 
       column2, 
       column3
FROM MY_TABLE;

Although Kinesis Data Analytics for SQL Applications use a subset of the SQL:2008 standard with extensions to enable operations on streaming data, Apache Flink’s SQL support is based on Apache Calcite, which implements the SQL standard.

It’s also important to mention that Kinesis Data Analytics Studio supports PyFlink and Scala alongside SQL within the same notebook. This allows you to perform complex, programmatic methods on your streaming data that aren’t possible with SQL.

Prerequisites

During this exercise, we set up various AWS resources and perform analytics queries. To follow along, you need an AWS account with administrator access. If you don’t already have an AWS account with administrator access, create one now. The services outlined in this post may incur charges to your AWS account. Make sure to follow the cleanup instructions at the end of this post.

Configure streaming data

In the streaming domain, we’re often tasked with exploring, transforming, and enriching data coming from Internet of Things (IoT) sensors. To generate the real-time sensor data, we employ the AWS IoT Device Simulator. This simulator runs within your AWS account and provides a web interface that lets users launch fleets of virtually connected devices from a user-defined template and then simulate them to publish data at regular intervals to AWS IoT Core. This means we can build a virtual fleet of devices to generate sample data for this exercise.

We deploy the IoT Device Simulator using the following Amazon CloudFront template. It handles creating all the necessary resources in your account.

On the Specify stack details page, assign a name to your solution stack.
Under Parameters, review the parameters for this solution template and modify them as necessary.
For User email, enter a valid email to receive a link and password to log in to the IoT Device Simulator UI.
Choose Next.
On the Configure stack options page, choose Next.
On the Review page, review and confirm the settings. Select the check boxes acknowledging that the template creates AWS Identity and Access Management (IAM) resources.
Choose Create stack.

The stack takes about 10 minutes to install.

When you receive your invitation email, choose the CloudFront link and log in to the IoT Device Simulator using the credentials provided in the email.

The solution contains a prebuilt automotive demo that we can use to begin delivering sensor data quickly to AWS.

On the Device Type page, choose Create Device Type.
Choose Automotive Demo.
The payload is auto populated. Enter a name for your device, and enter automotive-topic as the topic.
Choose Save.

Now we create a simulation.

On the Simulations page, choose Create Simulation.
For Simulation type, choose Automotive Demo.
For Select a device type, choose the demo device you created.
For Data transmission interval and Data transmission duration, enter your desired values.

You can enter any values you like, but use at least 10 devices transmitting every 10 seconds. You’ll want to set your data transmission duration to a few minutes, or you’ll need to restart your simulation several times during the lab.

Choose Save.

Now we can run the simulation.

On the Simulations page, select the desired simulation, and choose Start simulations.

Alternatively, choose View next to the simulation you want to run, then choose Start to run the simulation.

To view the simulation, choose View next to the simulation you want to view.

If the simulation is running, you can view a map with the locations of the devices, and up to 100 of the most recent messages sent to the IoT topic.

We can now check to ensure our simulator is sending the sensor data to AWS IoT Core.

Navigate to the AWS IoT Core console.

Make sure you’re in the same Region you deployed your IoT Device Simulator.

In the navigation pane, choose MQTT Test Client.
Enter the topic filter automotive-topic and choose Subscribe.

As long as you have your simulation running, the messages being sent to the IoT topic will be displayed.

Finally, we can set a rule to route the IoT messages to a Kinesis data stream. This stream will provide our source data for the Kinesis Data Analytics Studio notebook.

On the AWS IoT Core console, choose Message Routing and Rules.
Enter a name for the rule, such as automotive_route_kinesis, then choose Next.
Provide the following SQL statement. This SQL will select all message columns from the automotive-topic the IoT Device Simulator is publishing:

SELECT timestamp, trip_id, VIN, brake, steeringWheelAngle, torqueAtTransmission, engineSpeed, vehicleSpeed, acceleration, parkingBrakeStatus, brakePedalStatus, transmissionGearPosition, gearLeverPosition, odometer, ignitionStatus, fuelLevel, fuelConsumedSinceRestart, oilTemp, location 
FROM 'automotive-topic' WHERE 1=1

Choose Next.
Under Rule Actions, select Kinesis Stream as the source.
Choose Create New Kinesis Stream.

This opens a new window.

For Data stream name, enter automotive-data.

We use a provisioned stream for this exercise.

Choose Create Data Stream.

You may now close this window and return to the AWS IoT Core console.

Choose the refresh button next to Stream name, and choose the automotive-data stream.
Choose Create new role and name the role automotive-role.
Choose Next.
Review the rule properties, and choose Create.

The rule begins routing data immediately.

Set up Kinesis Data Analytics Studio

Now that we have our data streaming through AWS IoT Core and into a Kinesis data stream, we can create our Kinesis Data Analytics Studio notebook.

On the Amazon Kinesis console, choose Analytics applications in the navigation pane.
On the Studio tab, choose Create Studio notebook.
Leave Quick create with sample code selected.
Name the notebook automotive-data-notebook.
Choose Create to create a new AWS Glue database in a new window.
Choose Add database.
Name the database automotive-notebook-glue.
Choose Create.
Return to the Create Studio notebook section.
Choose refresh and choose your new AWS Glue database.
Choose Create Studio notebook.
To start the Studio notebook, choose Run and confirm.
Once the notebook is running, choose the notebook and choose Open in Apache Zeppelin.
Choose Import note.
Choose Add from URL.
Enter the following URL: https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-2461/auto-notebook.ipynb.
Choose Import Note.
Open the new note.

Perform stream analysis

In a Kinesis Data Analytics for SQL application, we add our streaming course via the management console, and then define an in-application stream and pump to stream data from our Kinesis data stream. The in-application stream functions as a table to hold the data and make it available for us to query. The pump takes the data from our source and streams it to our in-application stream. Queries may then be run against the in-application stream using SQL, just as we’d query any SQL table. See the following code:

CREATE OR REPLACE STREAM "AUTOSTREAM" ( 
    `trip_id` CHAR(36),
    `VIN` CHAR(17),
    `brake` FLOAT,
    `steeringWheelAngle` FLOAT,
    `torqueAtTransmission` FLOAT,
    `engineSpeed` FLOAT,
    `vehicleSpeed` FLOAT,
    `acceleration` FLOAT,
    `parkingBrakeStatus` BOOLEAN,
    `brakePedalStatus` BOOLEAN,
    `transmissionGearPosition` VARCHAR(10),
    `gearLeverPosition` VARCHAR(10),
    `odometer` FLOAT,
    `ignitionStatus` VARCHAR(4),
    `fuelLevel` FLOAT,
    `fuelConsumedSinceRestart` FLOAT,
    `oilTemp` FLOAT,
    `location` VARCHAR(100),
    `timestamp` TIMESTAMP(3));

CREATE OR REPLACE PUMP "MYPUMP" AS 
INSERT INTO "AUTOSTREAM" ("trip_id",
    "VIN",
    "brake",
    "steeringWheelAngle",
    "torqueAtTransmission",
    "engineSpeed",
    "vehicleSpeed",
    "acceleration",
    "parkingBrakeStatus",
    "brakePedalStatus",
    "transmissionGearPosition",
    "gearLeverPosition",
    "odometer",
    "ignitionStatus",
    "fuelLevel",
    "fuelConsumedSinceRestart",
    "oilTemp",
    "location",
    "timestamp")
SELECT VIN,
    brake,
    steeringWheelAngle,
    torqueAtTransmission,
    engineSpeed,
    vehicleSpeed,
    acceleration,
    parkingBrakeStatus,
    brakePedalStatus,
    transmissionGearPosition,
    gearLeverPosition,
    odometer,
    ignitionStatus,
    fuelLevel,
    fuelConsumedSinceRestart,
    oilTemp,
    location,
    timestamp
FROM "INPUT_STREAM"

To migrate an in-application stream and pump from our Kinesis Data Analytics for SQL application to Kinesis Data Analytics Studio, we convert this into a single CREATE statement by removing the pump definition and defining a kinesis connector. The first paragraph in the Zeppelin notebook sets up a connector that is presented as a table. We can define columns for all items in the incoming message, or a subset.

Run the statement, and a success result is output in your notebook. We can now query this table using SQL, or we can perform programmatic operations with this data using PyFlink or Scala.

Before performing real-time analytics on the streaming data, let’s look at how the data is currently formatted. To do this, we run a simple Flink SQL query on the table we just created. The SQL used in our streaming application is identical to what is used in a SQL application.

Note that if you don’t see records after a few seconds, make sure that your IoT Device Simulator is still running.

If you’re also running the Kinesis Data Analytics for SQL code, you may see a slightly different result set. This is another key differentiator in Kinesis Data Analytics for Apache Flink, because the latter has the concept of exactly once delivery. If this application is deployed to production and is restarted or if scaling actions occur, Kinesis Data Analytics for Apache Flink ensures you only receive each message once, whereas in a Kinesis Data Analytics for SQL application, you need to further process the incoming stream to ensure you ignore repeat messages that could affect your results.

You can stop the current paragraph by choosing the pause icon. You may see an error displayed in your notebook when you stop the query, but it can be ignored. It’s just letting you know that the process was canceled.

Flink SQL implements the SQL standard, and provides an easy way to perform calculations on the stream data just like you would when querying a database table. A common task while enriching data is to create a new field to store a calculation or conversion (such as from Fahrenheit to Celsius), or create new data to provide simpler queries or improved visualizations downstream. Run the next paragraph to see how we can add a Boolean value named accelerating, which we can easily use in our sink to know if an automobile was currently accelerating at the time the sensor was read. The process here doesn’t differ between Kinesis Data Analytics for SQL and Kinesis Data Analytics for Apache Flink.

You can stop the paragraph from running when you have inspected the new column, comparing our new Boolean value to the FLOAT acceleration column.

Data being sent from a sensor is usually compact to improve latency and performance. Being able to enrich the data stream with external data and enrich the stream, such as additional vehicle information or current weather data, can be very useful. In this example, let’s assume we want to bring in data currently stored in a CSV in Amazon S3, and add a column named color that reflects the current engine speed band.

Apache Flink SQL provides several source connectors for AWS services and other sources. Creating a new table like we did in our first paragraph but instead using the filesystem connector permits Flink to directly connect to Amazon S3 and read our source data. Previously in Kinesis Data Analytics for SQL Applications, you couldn’t add new references inline. Instead, you defined S3 reference data and added it to your application configuration, which you could then use as a reference in a SQL JOIN.

NOTE: If you are not using the us-east-1 region, you can download the csv and place the object your own S3 bucket. Reference the csv file as s3a://<bucket-name>/<key-name>

Building on the last query, the next paragraph performs a SQL JOIN on our current data and the new lookup source table we created.

Now that we have an enriched data stream, we restream this data. In a real-world scenario, we have many choices on what to do with our data, such as sending the data to an S3 data lake, another Kinesis data stream for further analysis, or storing the data in OpenSearch Service for visualization. For simplicity, we send the data to Kinesis Data Firehose, which streams the data into an S3 bucket acting as our data lake.

Kinesis Data Firehose can stream data to Amazon S3, OpenSearch Service, Amazon Redshift data warehouses, and Splunk in just a few clicks.

Create the Kinesis Data Firehose delivery stream

To create our delivery stream, complete the following steps:

On the Kinesis Data Firehose console, choose Create delivery stream.
Choose Direct PUT for the stream source and Amazon S3 as the target.
Name your delivery stream automotive-firehose.
Under Destination settings, create a new bucket or use an existing bucket.
Take note of the S3 bucket URL.
Choose Create delivery stream.

The stream takes a few seconds to create.

Return to the Kinesis Data Analytics console and choose Streaming applications.
On the Studio tab, and choose your Studio notebook.
Choose the link under IAM role.
In the IAM window, choose Add permissions and Attach policies.
Search for and select AmazonKinesisFullAccess and CloudWatchFullAccess, then choose Attach policy.
You may return to your Zeppelin notebook.

Stream data into Kinesis Data Firehose

As of Apache Flink v1.15, creating the connector to the Firehose delivery stream works similar to creating a connector to any Kinesis data stream. Note that there are two differences: the connector is firehose, and the stream attribute becomes delivery-stream.

After the connector is created, we can write to the connector like any SQL table.

To validate that we’re getting data through the delivery stream, open the Amazon S3 console and confirm you see files being created. Open the file to inspect the new data.

In Kinesis Data Analytics for SQL Applications, we would have created a new destination in the SQL application dashboard. To migrate an existing destination, you add a SQL statement to your notebook that defines the new destination right in the code. You can continue to write to the new destination as you would have with an INSERT while referencing the new table name.

Time data

Another common operation you can perform in Kinesis Data Analytics Studio notebooks is aggregation over a window of time. This sort of data can be used to send to another Kinesis data stream to identify anomalies, send alerts, or be stored for further processing. The next paragraph contains a SQL query that uses a tumbling window and aggregates total fuel consumed for the automotive fleet for 30-second periods. Like our last example, we could connect to another data stream and insert this data for further analysis.

Scala and PyFlink

There are times when a function you’d perform on your data stream is better written in a programming language instead of SQL, for both simplicity and maintenance. Some examples include complex calculations that SQL functions don’t support natively, certain string manipulations, the splitting of data into multiple streams, and interacting with other AWS services (such as text translation or sentiment analysis). Kinesis Data Analytics for Apache Flink has the ability to use multiple Flink interpreters within the Zeppelin notebook, which is not available in Kinesis Data Analytics for SQL Applications.

If you have been paying close attention to our data, you’ll see that the location field is a JSON string. In Kinesis Data Analytics for SQL, we could use string functions and define a SQL function and break apart the JSON string. This is a fragile approach depending on the stability of the message data, but this could be improved with several SQL functions. The syntax for creating a function in Kinesis Data Analytics for SQL follows this pattern:

CREATE FUNCTION ''<function_name>'' ( ''<parameter_list>'' )
    RETURNS ''<data type>''
    LANGUAGE SQL
    [ SPECIFIC ''<specific_function_name>''  | [NOT] DETERMINISTIC ]
    CONTAINS SQL
    [ READS SQL DATA ]
    [ MODIFIES SQL DATA ]
    [ RETURNS NULL ON NULL INPUT | CALLED ON NULL INPUT ]  
  RETURN ''<SQL-defined function body>''

In Kinesis Data Analytics for Apache Flink, AWS recently upgraded the Apache Flink environment to v1.15, which extends Apache Flink SQL’s table SQL to add JSON functions that are similar to JSON Path syntax. This allows us to query the JSON string directly in our SQL. See the following code:

%flink.ssql(type=update)
SELECT JSON_STRING(location, ‘$.latitude) AS latitude,
JSON_STRING(location, ‘$.longitude) AS longitude
FROM my_table

Alternatively, and required prior to Apache Flink v1.15, we can use Scala or PyFlink in our notebook to convert the field and restream the data. Both languages provide robust JSON string handling.

The following PyFlink code defines two user-defined functions, which extract the latitude and longitude from the location field of our message. These UDFs can then be invoked from using Flink SQL. We reference the environment variable st_env. PyFlink creates six variables for you in your Zeppelin notebook. Zeppelin also exposes a context for you as the variable z.

Errors can also happen when messages contain unexpected data. Kinesis Data Analytics for SQL Applications provides an in-application error stream. These errors can then be processed separately and restreamed or dropped. With PyFlink in Kinesis Data Analytics Streaming applications, you can write complex error-handling strategies and immediately recover and continue processing the data. When the JSON string is passed into the UDF, it may be malformed, incomplete, or empty. By catching the error in the UDF, Python will always return a value even if an error would have occurred.

The following sample code shows another PyFlink snippet that performs a division calculation on two fields. If a division-by-zero error is encountered, it provides a default value so the stream can continue processing the message.

%flink.pyflink
@udf(input_types=[DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def DivideByZero(price):    
	try:        
		price / 0        
	except:        
		return -1
st_env.register_function("DivideByZero", DivideByZero)

Next steps

Building a pipeline as we’ve done in this post gives us the base for testing additional services in AWS. I encourage you to continue your streaming analytics learning before tearing down the streams you created. Consider the following:

Publish your Kinesis Data Analytics Studio notebook as an application with durable state.
Use your Kinesis Data Firehose delivery stream to write directly to OpenSearch Service.
Use OpenSearch Dashboards to visualize your streaming data.
Review the Migrating to Kinesis Data Analytics for Apache Flink: Examples docs for side-by-side translations of common SQL-based Kinesis Data Analytics application queries to Kinesis Data Analytics Studio.

Clean up

To clean up the services created in this exercise, complete the following steps:

Navigate to the CloudFormation Console and delete the IoT Device Simulator stack.
On the AWS IoT Core console, choose Message Routing and Rules, and delete the rule automotive_route_kinesis.
Delete the Kinesis data stream automotive-data in the Kinesis Data Stream console.
Remove the IAM role automotive-role in the IAM Console.
In the AWS Glue console, delete the automotive-notebook-glue database.
Delete the Kinesis Data Analytics Studio notebook automotive-data-notebook.
Delete the Firehose delivery stream automotive-firehose.

Conclusion

Thanks for following along with this tutorial on Kinesis Data Analytics Studio. If you’re currently using a legacy Kinesis Data Analytics Studio SQL application, I recommend you reach out to your AWS technical account manager or Solutions Architect and discuss migrating to Kinesis Data Analytics Studio. You can continue your learning path in our Amazon Kinesis Data Streams Developer Guide, and access our code samples on GitHub.

About the Author

Nicholas Tunney is a Partner Solutions Architect for Worldwide Public Sector at AWS. He works with global SI partners to develop architectures on AWS for clients in the government, nonprofit healthcare, utility, and education sectors.

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

2023-05-30 Manish Kola

Post Syndicated from Manish Kola original https://aws.amazon.com/blogs/big-data/join-streaming-source-cdc-glue/

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data lakes are not transactional by default; however, there are multiple open-source frameworks that enhance data lakes with ACID properties, providing a best of both worlds solution between transactional and non-transactional storage mechanisms.

Traditional batch ingestion and processing pipelines that involve operations such as data cleaning and joining with reference data are straightforward to create and cost-efficient to maintain. However, there is a challenge to ingest datasets, such as Internet of Things (IoT) and clickstreams, at a fast rate with near-real-time delivery SLAs. You will also want to apply incremental updates with change data capture (CDC) from the source system to the destination. To make data-driven decisions in a timely manner, you need to account for missed records and backpressure, and maintain event ordering and integrity, especially if the reference data also changes rapidly.

In this post, we aim to address these challenges. We provide a step-by-step guide to join streaming data to a reference table changing in real time using AWS Glue, Amazon DynamoDB, and AWS Database Migration Service (AWS DMS). We also demonstrate how to ingest streaming data to a transactional data lake using Apache Hudi to achieve incremental updates with ACID transactions.

Solution overview

For our example use case, streaming data is coming through Amazon Kinesis Data Streams, and reference data is managed in MySQL. The reference data is continuously replicated from MySQL to DynamoDB through AWS DMS. The requirement here is to enrich the real-time stream data by joining with the reference data in near-real time, and to make it queryable from a query engine such as Amazon Athena while keeping consistency. In this use case, reference data in MySQL can be updated when the requirement is changed, and then queries need to return results by reflecting updates in the reference data.

This solution addresses the issue of users wanting to join streams with changing reference datasets when the size of the reference dataset is small. The reference data is maintained in DynamoDB tables, and the streaming job loads the full table into memory for each micro-batch, joining a high-throughput stream to a small reference dataset.

The following diagram illustrates the solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
A host with the installed MySQL utility, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance, AWS Cloud9, your laptop, and so on. The host is used to access an Amazon Aurora MySQL-Compatible Edition cluster that you create and to run a Python script that sends sample records to the Kinesis data stream.

Create IAM roles and S3 bucket

In this section, you create an Amazon Simple Storage Service (Amazon S3) bucket and two AWS Identity and Access Management (IAM) roles: one for the AWS Glue job, and one for AWS DMS. We do this using an AWS CloudFormation template. Complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack::
Choose Next.
For Stack name, enter a name for your stack.
For DynamoDBTableName, enter tgt_country_lookup_table. This is the name of your new DynamoDB table.
For S3BucketNamePrefix, enter the prefix of your new S3 bucket.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Stack creation can take about 1 minute.

Create a Kinesis data stream

In this section, you create a Kinesis data stream:

On the Kinesis console, choose Data streams in the navigation pane.
Choose Create data stream.
For Data stream name, enter your stream name.
Leave the remaining settings as default and choose Create data stream.

A Kinesis data stream is created with on-demand mode.

Create and configure an Aurora MySQL cluster

In this section, you create and configure an Aurora MySQL cluster as the source database. First, configure your source Aurora MySQL database cluster to enable CDC through AWS DMS to DynamoDB.

Create a parameter group

Complete the following steps to create a new parameter group:

On the Amazon RDS console, choose Parameter groups in the navigation pane.
Choose Create parameter group.
For Parameter group family, select aurora-mysql5.7.
For Type, choose DB Cluster Parameter Group.
For Group name, enter my-mysql-dynamodb-cdc.
For Description, enter Parameter group for demo Aurora MySQL database.
Choose Create.
Select my-mysql-dynamodb-cdc, and choose Edit under Parameter group actions.
Edit the parameter group as follows:

Name	Value
binlog_row_image	full
binlog_format	ROW
binlog_checksum	NONE
log_slave_updates	1

Choose Save changes.

Create the Aurora MySQL cluster

Complete following steps to create the Aurora MySQL cluster:

On the Amazon RDS console, choose Databases in the navigation pane.
Choose Create database.
For Choose a database creation method, choose Standard create.
Under Engine options, for Engine type, choose Aurora (MySQL Compatible).
For Engine version, choose Aurora (MySQL 5.7) 2.11.2.
For Templates, choose Production.
Under Settings, for DB cluster identifier, enter a name for your database.
For Master username, enter your primary user name.
For Master password and Confirm master password, enter your primary password.
Under Instance configuration, for DB instance class, choose Burstable classes (includes t classes) and choose db.t3.small.
Under Availability & durability, for Multi-AZ deployment, choose Don’t create an Aurora Replica.
Under Connectivity, for Compute resource, choose Don’t connect to an EC2 compute resource.
For Network type, choose IPv4.
For Virtual private cloud (VPC), choose your VPC.
For DB subnet group, choose your public subnet.
For Public access, choose Yes.
For VPC security group (firewall), choose the security group for your public subnet.
Under Database authentication, for Database authentication options, choose Password authentication.
Under Additional configuration, for DB cluster parameter group, choose the cluster parameter group you created earlier.
Choose Create database.

Grant permissions to the source database

The next step is to grant the required permission on the source Aurora MySQL database. Now you can connect to the DB cluster using the MySQL utility. You can run queries to complete the following tasks:

Create a demo database and table and run queries on the data
Grant permission for a user used by the AWS DMS endpoint

Complete the following steps:

Log in to the EC2 instance that you’re using to connect to your DB cluster.
Enter the following command at the command prompt to connect to the primary DB instance of your DB cluster:

$ mysql -h mycluster.cluster-123456789012.us-east-1.rds.amazonaws.com -P 3306 -u admin -p

Run the following SQL command to create a database:

> CREATE DATABASE mydev;

Run the following SQL command to create a table:

> use mydev; 
> CREATE TABLE country_lookup_table
(
code varchar(5),
countryname varchar(40) not null,
combinedname varchar(40) not null
);

Run the following SQL command to populate the table with data:

> INSERT INTO country_lookup_table(code, countryname, combinedname) VALUES ('IN', 'India', 'IN-India'), ('US', 'USA', 'US-USA'), ('CA', 'Canada', 'CA-Canada'), ('CN', 'China', 'CN-China');

Run the following SQL command to create a user for the AWS DMS endpoint and grant permissions for CDC tasks (replace the placeholder with your preferred password):

> CREATE USER repl IDENTIFIED BY '<your-password>';
> GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'repl'@'%';
> GRANT SELECT ON mydev.country_lookup_table TO 'repl'@'%';

Create and configure AWS DMS resources to load data into the DynamoDB reference table

In this section, you create and configure AWS DMS to replicate data into the DynamoDB reference table.

Create an AWS DMS replication instance

First, create an AWS DMS replication instance by completing the following steps:

On the AWS DMS console, choose Replication instances in the navigation pane.
Choose Create replication instance.
Under Settings, for Name, enter a name for your instance.
Under Instance configuration, for High Availability, choose Dev or test workload (Single-AZ).
Under Connectivity and security, for VPC security groups, choose default.
Choose Create replication instance.

Create Amazon VPC endpoints

Optionally, you can create Amazon VPC endpoints for DynamoDB when you need to connect to your DynamoDB table from the AWS DMS instance in a private network. Also make sure that you enable Publicly accessible when you need to connect to a database outside of your VPC.

Create an AWS DMS source endpoint

Create an AWS DMS source endpoint by completing the following steps:

On the AWS DMS console, choose Endpoints in the navigation pane.
Choose Create endpoint.
For Endpoint type, choose Source endpoint.
Under Endpoint configuration, for Endpoint identifier, enter a name for your endpoint.
For Source engine, choose Amazon Aurora MySQL.
For Access to endpoint database, choose Provide access information manually.
For Server Name, enter the endpoint name of your Aurora writer instance (for example, mycluster.cluster-123456789012.us-east-1.rds.amazonaws.com).
For Port, enter 3306.
For User name, enter a user name for your AWS DMS task.
For Password, enter a password.
Choose Create endpoint.

Crate an AWS DMS target endpoint

Create an AWS DMS target endpoint by completing the following steps:

On the AWS DMS console, choose Endpoints in the navigation pane.
Choose Create endpoint.
For Endpoint type, choose Target endpoint.
Under Endpoint configuration, for Endpoint identifier, enter a name for your endpoint.
For Target engine, choose Amazon DynamoDB.
For Service access role ARN, enter the IAM role for your AWS DMS task.
Choose Create endpoint.

Create AWS DMS migration tasks

Create AWS DMS database migration tasks by completing the following steps:

On the AWS DMS console, choose Database migration tasks in the navigation pane.
Choose Create task.
Under Task configuration, for Task identifier, enter a name for your task.
For Replication instance, choose your replication instance.
For Source database endpoint, choose your source endpoint.
For Target database endpoint, choose your target endpoint.
For Migration type, choose Migrate existing data and replicate ongoing changes.
Under Task settings, for Target table preparation mode, choose Do nothing.
For Stop task after full load completes, choose Don’t stop.
For LOB column settings, choose Limited LOB mode.
For Task logs, enable Turn on CloudWatch logs and Turn on batch-optimized apply.
Under Table mappings, choose JSON Editor and enter the following rules.

Here you can add values to the column. With the following rules, the AWS DMS CDC task will first create a new DynamoDB table with the specified name in target-table-name. Then it will replicate all the records, mapping the columns in the DB table to the attributes in the DynamoDB table.

{
    "rules": [
        {
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "mydev",
                "table-name": "country_lookup_table"
            },
            "rule-action": "include"
        },
        {
            "rule-type": "object-mapping",
            "rule-id": "2",
            "rule-name": "2",
            "rule-action": "map-record-to-record",
            "object-locator": {
                "schema-name": "mydev",
                "table-name": "country_lookup_table"
            },
            "target-table-name": "tgt_country_lookup_table",
            "mapping-parameters": {
                "partition-key-name": "code",
                "sort-key-name": "countryname",
                "exclude-columns": [
                    "code",
                    "countryname"
                ],
                "attribute-mappings": [
                    {
                        "target-attribute-name": "code",
                        "attribute-type": "scalar",
                        "attribute-sub-type": "string",
                        "value": "${code}"
                    },
                    {
                        "target-attribute-name": "countryname",
                        "attribute-type": "scalar",
                        "attribute-sub-type": "string",
                        "value": "${countryname}"
                    }
                ],
                "apply-during-cdc": true
            }
        }
    ]
}

Choose Create task.

Now the AWS DMS replication task has been started.

Wait for the Status to show as Load complete.

On the DynamoDB console, choose Tables in the navigation pane.
Select the DynamoDB reference table, and choose Explore table items to review the replicated records.

Create an AWS Glue Data Catalog table and an AWS Glue streaming ETL job

In this section, you create an AWS Glue Data Catalog table and an AWS Glue streaming extract, transform, and load (ETL) job.

Create a Data Catalog table

Create an AWS Glue Data Catalog table for the source Kinesis data stream with the following steps:

On the AWS Glue console, choose Databases under Data Catalog in the navigation pane.
Choose Add database.
For Name, enter my_kinesis_db.
Choose Create database.
Choose Tables under Databases, then choose Add table.
For Name, enter my_stream_src_table.
For Database, choose my_kinesis_db.
For Select the type of source, choose Kinesis.
For Kinesis data stream is located in, choose my account.
For Kinesis stream name, enter a name for your data stream.
For Classification, select JSON.
Choose Next.
Choose Edit schema as JSON, enter the following JSON, then choose Save.

[
  {
    "Name": "uuid",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "country",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "itemtype",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "saleschannel",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "orderpriority",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "orderdate",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "region",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "shipdate",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "unitssold",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "unitprice",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "unitcost",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "totalrevenue",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "totalcost",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "totalprofit",
    "Type": "string",
    "Comment": ""
  },
  {
    "Name": "impressiontime",
    "Type": "string",
    "Comment": ""
  }
]

1. Choose Next, then choose Create.

Create an AWS Glue streaming ETL job

Next, you create an AWS Glue streaming job. AWS Glue 3.0 and later supports Apache Hudi natively, so we use this native integration to ingest into a Hudi table. Complete the following steps to create the AWS Glue streaming job:

On the AWS Glue Studio console, choose Spark script editor and choose Create.
Under Job details tab, for Name, enter a name for your job.
For IAM Role, choose the IAM role for your AWS Glue job.
For Type, select Spark Streaming.
For Glue version, choose Glue 4.0 – Supports spark 3.3, Scala 2, Python 3.
For Requested number of workers, enter 3.
Under Advanced properties, for Job parameters, choose Add new parameter.
For Key, enter --conf.
For Value, enter spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false.
Choose Add new parameter.
For Key, enter --datalake-formats.
For Value, enter hudi.
For Script path, enter s3://<S3BucketName>/scripts/.
For Temporary path, enter s3://<S3BucketName>/temporary/.
Optionally, for Spark UI logs path, enter s3://<S3BucketName>/sparkHistoryLogs/.

On the Script tab, enter the following script into the AWS Glue Studio editor and choose Create.

The near-real-time streaming job enriches data by joining a Kinesis data stream with a DynamoDB table that contains frequently updated reference data. The enriched dataset is loaded into the target Hudi table in the data lake. Replace <S3BucketName> with your bucket that you created via AWS CloudFormation:

import sys, json
import boto3
from pyspark.sql import DataFrame, Row
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv,["JOB_NAME"])

# Initialize spark session and Glue context
sc = SparkContext() 
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# job paramters
dydb_lookup_table = "tgt_country_lookup_table"
kin_src_database_name = "my_kinesis_db" 
kin_src_table_name = "my_stream_src_table" 
hudi_write_operation = "upsert" 
hudi_record_key = "uuid" 
hudi_precomb_key = "orderdate" 
checkpoint_path = "s3://<S3BucketName>/streamlab/checkpoint/" 
s3_output_folder = "s3://<S3BucketName>/output/"
hudi_table = "hudi_table"
hudi_database = "my_kinesis_db"

# hudi options 
additional_options={
    "hoodie.datasource.hive_sync.use_jdbc": "false",
    "hoodie.datasource.write.recordkey.field": hudi_record_key,
    "hoodie.datasource.hive_sync.database": hudi_database,
    "hoodie.table.name": hudi_table,
    "hoodie.consistency.check.enabled": "true",
    "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.NonPartitionedExtractor",
    "hoodie.datasource.write.hive_style_partitioning": "false",
    "hoodie.datasource.write.precombine.field": hudi_precomb_key,
    "hoodie.bulkinsert.shuffle.parallelism": "4",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.write.operation": hudi_write_operation,
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
}

# Scan and load the reference data table from DynamoDB into AWS Glue DynamicFrames using boto3 API.
def readDynamoDb():
    dynamodb = boto3.resource(“dynamodb”)
    table = dynamodb.Table(dydb_lookup_table)
    response = table.scan()
    items = response[“Items”]
    jsondata = sc.parallelize(items)
    lookupDf = glueContext.read.json(jsondata)
    return lookupDf


# Load the Amazon Kinesis data stream from Amazon Glue Data Catalog.
source_df = glueContext.create_data_frame.from_catalog(
    database=kin_src_database_name,
    table_name=kin_src_table_name,
    transformation_ctx=”source_df”,
    additional_options={“startingPosition”: “TRIM_HORIZON”},
)

# As part of batch processing, implement the transformation logic for joining streaming data frames with reference data frames.
def processBatch(data_frame, batchId):
    if data_frame.count() > 0:

        # Refresh the dymanodb table to pull latest snapshot for each microbatch
        country_lookup_df = readDynamoDb() 
                
        final_frame = data_frame.join(
            country_lookup_df, 
            data_frame["country"] == country_lookup_df["countryname"], 
            'left'
        ).drop(
            "countryname",
            "country",
            "unitprice", 
            "unitcost",
            "totalrevenue",
            "totalcost",
            "totalprofit"
        )

        # Script generated for node my-lab-hudi-connector
        final_frame.write.format("hudi") \
            .options(**additional_options) \
            .mode("append") \
            .save(s3_output_folder)
        
try:
    glueContext.forEachBatch(
        frame=source_df,
        batch_function=processBatch,
        options={"windowSize": "60 seconds", "checkpointLocation": checkpoint_path},
    )
except Exception as e:
    print(f"Error is @@@ ....{e}")

Choose Run to start the streaming job.

The following screenshot shows examples of the DataFrames data_frame, country_lookup_df, and final_frame.

The AWS Glue job successfully joined records coming from the Kinesis data stream and the reference table in DynamoDB, and then ingested the joined records into Amazon S3 in Hudi format.

Create and run a Python script to generate sample data and load it into the Kinesis data stream

In this section, you create and run a Python to generate sample data and load it into the source Kinesis data stream. Complete the following steps:

Log in to AWS Cloud9, your EC2 instance, or any other computing host that puts records in your data stream.
Create a Python file called generate-data-for-kds.py:

$ python3 generate-data-for-kds.py

Open the Python file and enter the following script:

import json
import random
import boto3
import time

STREAM_NAME = "<mystreamname>"

def get_data():
    return {
        "uuid": random.randrange(0, 1000001, 1),
        "country": random.choice( [ "United Arab Emirates", "China", "India", "United Kingdom", "United States of America", ] ),
        "itemtype": random.choice( [ "Snacks", "Cereals", "Cosmetics", "Fruits", "Clothes", "Babycare", "Household", ] ),
        "saleschannel": random.choice( [ "Snacks", "Cereals", "Cosmetics", "Fruits", "Clothes", "Babycare", "Household", ] ),
        "orderpriority": random.choice(["H", "L", "M", "C"]),
        "orderdate": random.choice( [ "1/4/10", "2/28/10", "2/15/11", "11/8/11", "2/1/12", "2/18/12", "3/1/12", "9/24/12",
                                      "10/13/12", "12/2/12", "12/29/12", "3/30/13", "7/29/13", "3/23/14", "6/14/14", 
                                      "7/15/14", "10/19/14", "5/7/15", "10/11/15", "11/22/15", "8/23/16", "1/15/17", 
                                      "1/27/17", "2/25/17", "3/10/17", "4/1/17", ] ),
        "region": random.choice( ["Asia" "Europe", "Americas", "Middle Eastern", "Africa"] ),
        "shipdate": random.choice( [ "1/4/10", "2/28/10", "2/15/11", "11/8/11", "2/1/12", "2/18/12", "3/1/12", "9/24/12", 
                                    "10/13/12", "12/2/12", "12/29/12", "3/30/13", "7/29/13", "3/23/14", "6/14/14", "7/15/14",
                                      "10/19/14", "5/7/15", "10/11/15", "11/22/15", "8/23/16", "1/15/17", "1/27/17", 
                                      "2/25/17", "3/10/17", "4/1/17", ] ),
        "unitssold": random.choice( [ "8217", "3465", "8877", "2882", "70", "7044", "6307", "2384", "1327", "2572", "8794", 
                                     "4131", "5793", "9091", "4314", "9085", "5270", "5459", "1982", "8245", "4860", "4656", 
                                     "8072", "65", "7864", "9778", ] ),
        "unitprice": random.choice( [ "97.44", "117.11", "364.69", "502.54", "263.33", "117.11", "35.84", "6.92", "35.84", 
                                     "6.92", "35.84", "56.67", "159.42", "502.54", "117.11", "56.67", "524.96", "502.54", 
                                     "56.67", "56.67", "159.42", "56.67", "35.84", "159.42", "502.54", "31.79", ] ),
        "unitcost": random.choice( [ "97.44", "117.11", "364.69", "502.54", "263.33", "117.11", "35.84", "6.92", "35.84", 
                                    "6.92", "35.84", "56.67", "159.42", "502.54", "117.11", "56.67", "524.96", "502.54", 
                                    "56.67", "56.67", "159.42", "56.67", "35.84", "159.42", "502.54", "31.79", ] ),
        "totalrevenue": random.choice( [ "1253749.86", "712750.5", "3745117.53", "1925954.14", "30604", "1448950.8", 
                                        "689228.96", "22242.72", "145014.56", "23996.76", "961008.32", "337626.63", 
                                        "1478837.04", "6075242.57", "887389.8", "742517.05", "3431876.7", "3648085.93", 
                                        "161988.86", "673863.85", "1240660.8", "380534.88", "882108.16", "16593.2", 
                                        "5255275.28", "463966.1", ] ),
        "totalcost": random.choice( [ "800664.48", "405786.15", "3237353.13", "1448320.28", "18433.1", "824922.84", 
                                     "226042.88", "16497.28", "47559.68", "17798.24", "315176.96", "234103.77", "923520.06",
                                       "4568591.14", "505212.54", "514846.95", "2766539.2", "2743365.86", 
                                       "112319.94", "467244.15", "774781.2", "263855.52", "289300.48", "10362.3", 
                                       "3951974.56", "310842.62", ] ),
        "totalprofit": random.choice( [ "453085.38", "306964.35", "507764.4", "477633.86", "12170.9", "624027.96", 
                                       "463186.08", "5745.44", "97454.88", "6198.52", "645831.36", "103522.86", "555316.98", 
                                       "1506651.43", "382177.26", "227670.1", "665337.5", "904720.07", "49668.92", "206619.7",
                                         "465879.6", "116679.36", "592807.68", "6230.9", "1303300.72", "153123.48", ] ),
        "impressiontime": random.choice( [ "2022-10-24T02:27:41Z", "2022-10-24T02:27:41Z", "2022-11-24T02:27:41Z", 
                                          "2022-12-24T02:27:41Z", "2022-13-24T02:27:41Z", "2022-14-24T02:27:41Z", 
                                          "2022-15-24T02:27:41Z", ] ),
    }

def generate(stream_name, kinesis_client):
    while True:
        data = get_data()
        print(data)
        kinesis_client.put_record(
            StreamName=stream_name, Data=json.dumps(data), PartitionKey="partitionkey"
        )
        time.sleep(2)

if __name__ == "__main__":
    generate(STREAM_NAME, boto3.client("kinesis"))

This script puts a Kinesis data stream record every 2 seconds.

Simulate updating the reference table in the Aurora MySQL cluster

Now all the resources and configurations are ready. For this example, we want to add a 3-digit country code to the reference table. Let’s update records in the Aurora MySQL table to simulate changes. Complete the following steps:

Make sure that the AWS Glue streaming job is already running.
Connect to the primary DB instance again, as described earlier.
Enter your SQL commands to update records:

> UPDATE country_lookup_table SET combinedname='US-USA-US' WHERE code='US';
> UPDATE country_lookup_table SET combinedname='CA-CAN-Canada' WHERE code='CA';
> UPDATE country_lookup_table SET combinedname='CN-CHN-China' WHERE code='CN';
> UPDATE country_lookup_table SET combinedname='IN-IND-India' WHERE code='IN';

Now the reference table in the Aurora MySQL source database has been updated. Then the changes are automatically replicated to the reference table in DynamoDB.

The following tables show records in data_frame, country_lookup_df, and final_frame. In country_lookup_df and final_frame, the combinedname column has values formatted as <2-digit-country-code>-<3-digit-country-code>-<country-name>, which shows that the changed records in the referenced table are reflected in the table without restarting the AWS Glue streaming job. It means that the AWS Glue job successfully joins the incoming records from the Kinesis data stream with the reference table even when the reference table is changing.

Query the Hudi table using Athena

Let’s query the Hudi table using Athena to see the records in the destination table. Complete the following steps:

Make sure that the script and the AWS Glue Streaming job is still working:
1. The Python script (generate-data-for-kds.py) is still running.
2. The generated data is being sent to the data stream.
3. The AWS Glue streaming job is still running.
On the Athena console, run the following SQL in the query editor:

select shipdate, unitssold, impressiontime, code,combinedname from <database>.<table>
where combinedname is not null
limit 10;

The following query result shows the records that are processed before the referenced table was changed. Records in the combinedname column are similar to <2-digit-country-code>-<country-name>.

The following query result shows the records that are processed after the referenced table was changed. Records in the combinedname column are similar to <2-digit-country-code>-<3-digit-country-code>-<country-name>.

Now you understand that the changed reference data is successfully reflected in the target Hudi table joining records from the Kinesis data stream and the reference data in DynamoDB.

Clean up

As the final step, clean up the resources:

Delete the Kinesis data stream.
Delete the AWS DMS migration task, endpoint, and replication instance.
Stop and delete the AWS Glue streaming job.
Delete the AWS Cloud9 environment.
Delete the CloudFormation template.

Conclusion

Building and maintaining a transactional data lake that involves real-time data ingestion and processing has multiple variable components and decisions to be made, such as what ingestion service to use, how to store your reference data, and what transactional data lake framework to use. In this post, we provided the implementation details of such a pipeline, using AWS native components as the building blocks and Apache Hudi as the open-source framework for a transactional data lake.

We believe that this solution can be a starting point for organizations looking to implement a new data lake with such requirements. Additionally, the different components are fully pluggable and can be mixed and matched to existing data lakes to target new requirements or migrate existing ones, addressing their pain points.

About the authors

Manish Kola is a Data Lab Solutions Architect at AWS, where he works closely with customers across various industries to architect cloud-native solutions for their data analytics and AI needs. He partners with customers on their AWS journey to solve their business problems and build scalable prototypes. Before joining AWS, Manish’s experience includes helping customers implement data warehouse, BI, data integration, and data lake projects.

Santosh Kotagiri is a Solutions Architect at AWS with experience in data analytics and cloud solutions leading to tangible business results. His expertise lies in designing and implementing scalable data analytics solutions for clients across industries, with a focus on cloud-native and open-source services. He is passionate about leveraging technology to drive business growth and solve complex problems.

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Real-time anomaly detection via Random Cut Forest in Amazon Kinesis Data Analytics

2023-05-01 Daren Wong

Post Syndicated from Daren Wong original https://aws.amazon.com/blogs/big-data/real-time-anomaly-detection-via-random-cut-forest-in-amazon-kinesis-data-analytics/

Real-time anomaly detection describes a use case to detect and flag unexpected behavior in streaming data as it occurs. Online machine learning (ML) algorithms are popular for this use case because they don’t require any explicit rules and are able to adapt to a changing baseline, which is particularly useful for continuous streams of data where incoming data changes continuously over time.

Random Cut Forest (RCF) is one such algorithm widely used for anomaly detection use cases. In typical setups, we want to be able to run the RCF algorithm on input data with large throughput, and streaming data processing frameworks can help with that. We are excited to share that RCF is possible with Amazon Kinesis Data Analytics for Apache Flink. Apache Flink is a popular open-source framework for real-time, stateful computations over data streams, and can be used to run RCF on input streams with large throughput.

This post demonstrates how we can use Kinesis Data Analytics for Apache Flink to run an online RCF algorithm for anomaly detection.

Solution overview

The following diagram illustrates our architecture, which consists of three components: an input data stream using Amazon Kinesis Data Streams, a Flink job, and an output Kinesis data stream. In terms of data flow, we use a Python script to generate anomalous sine wave data into the input data stream, the data is then processed by RCF in a Flink job, and the resultant anomaly score is delivered to the output data stream.

The following graph shows an example of our expected result, which indicates that the anomaly score peaked when the sine wave data source anomalously dropped to constant -17.

We can implement this solution in three simple steps:

Set up AWS resources via AWS CloudFormation.
Set up a data generator to produce data into the source data stream.
Run the RCF Flink Java code on Kinesis Data Analytics.

Set up AWS resources via AWS CloudFormation

The following CloudFormation stack will create all the AWS resources we need for this tutorial, including two Kinesis data streams, a Kinesis Data Analytics app, and an Amazon Simple Storage Service (Amazon S3) bucket.

Follow the steps on the AWS CloudFormation console to create the stack.

Set up a data generator

Run the following Python script to populate the input data stream with the anomalous sine wave data:

import json
import boto3
import math 

STREAM_NAME = "ExampleInputStream-RCF"


def get_data(time):
    rad = (time/100)%360
    val = math.sin(rad)*10 + 10

    if rad > 2.4 and rad < 2.6:
        val = -17

    return {'time': time, 'value': val}

def generate(stream_name, kinesis_client):
    time = 0

    while True:
        data = get_data(time)
        kinesis_client.put_record(
            StreamName=stream_name,
            Data=json.dumps(data),
            PartitionKey="partitionkey")

        time += 1


if __name__ == '__main__':
    generate(STREAM_NAME, boto3.client('kinesis', region_name='us-west-2'))

Run the RCF Flink Java code on Kinesis Data Analytics

The CloudFormation stack automatically downloaded and packaged the RCF Flink job JAR file for you. Therefore, you can simply go to the Kinesis Data Analytics console to run your application.

That’s it! We now have a running Flink job that continuously reads in data from an input Kinesis data stream and calculates the anomaly score for each new data point given the previous data points it has seen.

The following sections explain the RCF implementation and Flink job code in more detail.

RCF implementation

Numerous RCF implementations are publicly available. For this tutorial, we use the AWS implementation by wrapping it around a custom wrapper (RandomCutForestOperator) to be used in our Flink job.

RandomCutForestOperator is implemented as an Apache Flink ProcessFunction, which is a function that allows us to write custom logic to process every element in the stream. Our custom logic starts with a data transformation via inputDataMapper.apply, followed by getting the anomaly score by calling the AWS RCF library via rcf.getAnomalyScore. The code implementation of RandomCutForestOperator can be found on GitHub.

RandomCutForestOperatorBuilder requires two main types of parameters:

RandomCutForestOperator hyperparameters – We use the following:
- Dimensions – We set this to 1 because our input data is a 1-dimensional sine wave consisting of the float data type.
- ShingleSize – We set this to 1, which means our RCF algorithm will take into account the previous and current data points in anomaly score deduction. Note that this can be increased to account for seasonality in data.
- SampleSize – We set this to 628, which means a maximum of 628 data points is kept in the data sample for each tree.
DataMapper parameters for input and output processing – We use the following:
- InputDataMapper – We use RandomCutForestOperator.SIMPLE_FLOAT_INPUT_DATA_MAPPER to map input data from float to float[].
- ResultMapper – We use RandomCutForestOperator.SIMPLE_TUPLE_RESULT_DATA_MAPPER, which is a BiFunction that joins the anomaly score with the corresponding sine wave data point into a tuple.

Flink job code

The following code snippet illustrates the core streaming structure of our Apache Flink streaming Java code. It first reads in data from the source Kinesis data stream, then processes it using the RCF algorithm. The computed anomaly score is then written to an output Kinesis data stream.

DataStream<Float> sineWaveSource = createSourceFromStaticConfig(env);

sineWaveSource
        .process(
                RandomCutForestOperator.<Float, Tuple2<Float, Double>>builder()
                        .setDimensions(1)
                        .setShingleSize(1)
                        .setSampleSize(628)
                        .setInputDataMapper(RandomCutForestOperator.SIMPLE_FLOAT_INPUT_DATA_MAPPER)
                        .setResultMapper(RandomCutForestOperator.SIMPLE_TUPLE_RESULT_DATA_MAPPER)
                        .build(),
                TupleTypeInfo.getBasicTupleTypeInfo(Float.class, Double.class))
       .addSink(createSinkFromStaticConfig());

In this example, our baseline input data is a sine wave. As shown in the following screenshot, a low anomaly score is returned when the data is regular. However, when there is an anomaly in the data (when the sine wave input data drops to a constant), a high anomaly score is returned. The anomaly score is delivered into an output Kinesis data stream. You can visualize this result by creating a Kinesis Data Analytics Studio app; for instructions, refer to Interactive analysis of streaming data.

Because this is an unsupervised algorithm, you don’t need to provide any explicit rules or labeled datasets for this operator. In short, only the input data stream, data conversions, and some hyperparameters were provided. The RCF algorithm itself determined the expected baseline based on the input data and identified any unexpected behavior.

Furthermore, this means the model will continuously adapt even if the baseline changes over time. As such, minimal retraining cadence is required. This is powerful for anomaly detection on streaming data because the data will often drift slowly over time due seasonal trends, inflation, equipment calibration drift, and so on.

Clean up

To avoid incurring future charges, complete the following steps:

On the Amazon S3 console, empty the S3 bucket created by the CloudFormation stack.
On the AWS CloudFormation console, delete the CloudFormation stack.

Conclusion

This post demonstrated how to perform anomaly detection on input streaming data with RCF, an online unsupervised ML algorithm using Kinesis Data Analytics. We also showed how this algorithm learns the data baseline on its own, and can adapt to changes in the baseline over time. We hope you consider this solution for your real-time anomaly detection use cases.

About the Authors

Daren Wong is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Aleksandr Pilipenko is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Hong Liang Teoh is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Creating scalable architectures with AWS IoT Greengrass stream manager

2023-03-31 Neil Mehta

Post Syndicated from Neil Mehta original https://aws.amazon.com/blogs/architecture/creating-scalable-architectures-with-aws-iot-greengrass-stream-manager/

Designing a scalable, global, real-time, distributed system to process millions of messages from a variety of critical devices can complicate architectures. Collecting large data streams or image recognition from the edge also requires scalable solutions.

AWS IoT Core is designed to handle large numbers of Internet of things (IoT) devices sending a few messages per second. However, when IoT devices send large numbers of messages per second and processing occurs at the edge, managing large data streams is challenging. Further, data that is buffered or processed at the edge can increase latency.

This post describes how to create a scalable IoT architecture with AWS IoT Greengrass stream manager that can handle thousands of critical messages per second from a variety of IoT devices.

Relaying critical messages from high-throughput IoT devices

Let’s explore an example. Consider the architecture in Figure 1, where you have two types of IoT devices sending messages to AWS IoT Core. One set of IoT devices sends thousands of messages per second while the other set of devices sends tens of messages per second.

The IoT devices that are sending thousands of messages per second contain critical data that cannot be lost and must be processed at the edge. The IoT devices sending tens of messages per second are of less importance.

Figure 1. IoT devices sending messages to AWS IoT Core

In considering this architecture, the critical IoT devices sending thousands of messages per second take the following path:

IoT devices send data to an AWS IoT Greengrass component for processing with a Quality of Service (QoS) of 0.
The AWS IoT Greengrass component processes the data and sends it to the AWS IoT Greengrass message broker.
AWS IoT Greengrass message broker relays the data to AWS IoT Core with a QoS of 1.
AWS IoT Core sends the data to Amazon Kinesis Data Streams for further processing.

In contrast, the IoT devices that send a few messages per second take the following path:

IoT devices send data to the AWS IoT Greengrass message broker.
The AWS IoT Greengrass message broker relays the data to AWS IoT Core.
AWS IoT Core sends the data to Kinesis Data Streams for further processing.

Due to the critical nature of the messages being sent, the AWS IoT Greengrass message broker is configured to send messages to the AWS IoT Core with a QoS of 1. However, when you configure QoS to 1, the AWS IoT Greengrass message broker has to wait for an acknowledgement (ACK) before sending more data.

As you add more IoT devices, many choose to batch the messages before sending them to AWS IoT Core. This can be a good strategy when you are dealing with many IoT devices that send a small number of messages per second. But when you are adding IoT devices that send thousands of messages per second, the time waiting for an ACK can add latency and cause inconsistencies when reporting the data downstream.

This is because the AWS IoT Greengrass message broker is capable of sending 100 messages to AWS IoT Core before waiting for an ACK when QoS is set to 1. As a result, scaling this architecture to handle additional IoT devices can become challenging.

For more information about the AWS IoT Greengrass message broker’s limits, refer to the AWS IoT Core message broker and protocol limits and quotas section of the AWS General Reference Guide.

AWS IoT Greengrass stream manager for speed and reliability

To scale this type of architecture, you can use a pre-built AWS IoT Greengrass component called AWS IoT Greengrass stream manager to bypass the Greengrass message broker and AWS IoT Core to send your data directly to AWS IoT Analytics, AWS IoT SiteWise, Amazon Kinesis, or Amazon Simple Storage Service (Amazon S3).

For example, consider the earlier scenario where one set of IoT devices is sending thousands of critical messages per second and another set is sending data of less importance.

Instead, you can use AWS IoT Greengrass stream manager to create an architecture that can easily and reliably send large amounts of data from the edge directly to Kinesis, as in Figure 2.

Figure 2. AWS IoT Greengrass stream manager sending data directly to Kinesis

As opposed to the Figure 1 configuration, the critical IoT devices that send thousands of messages per second can now take the following path:

Critical IoT devices send data to an AWS IoT Greengrass component for processing.
The AWS IoT Greengrass component processes the data and sends it to AWS IoT Core stream manager.
AWS IoT Greengrass stream manager sends the data directly Amazon Kinesis Data Streams.

Note that the IoT devices sending a few messages per second can also be sent to AWS IoT Greengrass stream manager at a lower priority. You are still using AWS IoT Core, but it is no longer the main data path. By continuing to use AWS IoT Core, you can benefit from its control plane features such as managing updates, certificates, and policies. However, AWS IoT Core’s data plane features—like Rules Engine—are no longer used in this architecture, which can help reduce costs. If you choose to bypass the AWS IoT Greengrass message broker and use AWS IoT Core stream manager, any components that you have built must be moved so that processing occurs at the edge.

In the architecture in Figure 2, AWS IoT Greengrass stream manager is being used to bypass the main data path away from the AWS IoT Greengrass message broker and AWS IoT Core. Bypassing these services reduces the latency in Figure 1 caused by the AWS IoT Greengrass message broker waiting for an ACK from AWS IoT Core.

AWS IoT Greengrass stream manager can handle thousands of messages per second so you can:

Reliably scale your architecture
Create multiple data paths to send both critical and non-critical data to AWS IoT Core stream manager while still leveraging AWS IoT Core control plane features
Prioritize your critical data paths to have specific IoT devices take higher priority
Create configurations to handle situations where you have IoT devices with limited or intermittent connectivity. (For example, you can create configurations to use local storage or memory to cache your data when internet connectivity is lost, and then flush data when an ACK is received from the destination.)

All of these features can help you reduce latency, costs, data inconsistencies, and the potential loss of critical data. They also provide a mechanism to scale the number of devices that your architecture can reliably handle.

Get started with stream manager by using the AWS IoT Greengrass Core SDK or the AWS IoT console.

Conclusion

In this blog post, we discussed how to create a scalable IoT architecture that can handle thousands of critical messages per second from a variety of IoT devices. Incorporating AWS IoT Greengrass stream manager into your architecture can help reduce latency, data inconsistencies, and the potential loss of critical data by providing a way to bypass AWS IoT Core and send large amounts of data efficiently and reliably.

Build a real-time GDPR-aligned Apache Iceberg data lake

2023-02-24 Dhiraj Thakur

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/build-a-real-time-gdpr-aligned-apache-iceberg-data-lake/

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data operators who must be able to erase or update personal data from their data lake when requested.

A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment. When a customer asks to erase or update private data, the data lake operator needs to find the required objects in Amazon S3 that contain the required data and take steps to erase or update that data. This activity can be a complex process for the following reasons:

Data lakes may contain many S3 objects (each may contain multiple rows), and often it’s difficult to find the object containing the exact data that needs to be erased or personally identifiable information (PII) to be updated as per the request
By nature, S3 objects are immutable and therefore applying direct row-based transactions like DELETE or UPDATE isn’t possible

To handle these situations, a transactional feature on S3 objects is required, and frameworks such as Apache Hudi or Apache Iceberg provide you the transactional feature for upserts in Amazon S3.

AWS contributed the Apache Iceberg integration with the AWS Glue Data Catalog, which enables you to use open-source data computation engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg, enabling transaction queries on S3 objects.

In this post, we show you how to stream real-time data to an Iceberg table in Amazon S3 using AWS Glue streaming and perform transactions using Amazon Athena for deletes and updates. We use a serverless mechanism for this implementation, which requires minimum operational overhead to manage and fine-tune various configuration parameters, and enables you to extend your use case to ACID operations beyond the GDPR.

Solution overview

We used the Amazon Kinesis Data Generator (KDG) to produce synthetic streaming data in Amazon Kinesis Data Streams and then processed the streaming input data using AWS Glue streaming to store the data in Amazon S3 in Iceberg table format. As part of the customer’s request, we ran delete and update statements using Athena with Iceberg support.

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

Streaming data is generated in JSON format using the KDG template and inserted into Kinesis Data Streams.
An AWS Glue streaming job is connected to Kinesis Data Streams to process the data using the Iceberg connector.
The streaming job output is stored in Amazon S3 in Iceberg table format.
Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in Iceberg format.
Athena interacts with the Data Catalog tables in Iceberg format for transactional queries required for GDPR.

The codebase required for this post is available in the GitHub repository.

Prerequisites

Before starting the implementation, make sure the following prerequisites are met:

You have an AWS account
The Region used for this post is us-east-1
You have access to the AWS CloudFormation template from GitHub to set up the basic services and AWS Identity and Access Management (IAM) role

Deploy resources using AWS CloudFormation

Complete the following steps to deploy your solution resources:

After you sign in to your AWS account, launch the CloudFormation template by choosing Launch Stack:
For Stack name, enter a name.
For Username, enter the user name for the KDG.
For Password, enter the password for the KDG (this must be at least six alphanumeric characters, and contain at least one number).
For IAMGlueStreamingJobRoleName, enter a name for the IAM role used for the AWS Glue streaming job.
Choose Next and create your stack.

This CloudFormation template configures the following resources in your account:

An S3 bucket named streamingicebergdemo-XX (note that the XX part is a random unique number to make the S3 bucket name unique)
An IAM policy and role
The KDG URL used for creating synthetic data

After you complete the setup, go to the Outputs tab of the CloudFormation stack to get the S3 bucket name, AWS Glue job execution role (as per your input), and KDG URL.
Before proceeding with the demo, create a folder named custdata under the created S3 bucket.

Create a Kinesis data stream

We use Kinesis Data Streams to create a serverless streaming data service that is built to handle millions of events with low latency. The following steps guide you on how to create the data stream in the us-east-1 Region:

Log in to the AWS Management Console.
Navigate to Kinesis console (make sure the Region is us-east-1).
Select Kinesis Data Streams and choose Create data stream.
For Data stream name, enter demo-data-stream.
For this post, we select On-demand as the Kinesis data stream capacity mode.

On-demand mode works to eliminate the need for provisioning and managing the capacity for streaming data. However, you can implement this solution with Kinesis Data Streams in provisioned mode as well.

Choose Create data stream.
Wait for successful creation of demo-data-stream and for it to be in Active status.

Set up the Kinesis Data Generator

To create a sample streaming dataset, we use the KDG URL generated on the CloudFormation stack Outputs tab and log in with the credentials used in the parameters for the CloudFormation template. For this post, we use the following template to generate sample data in the demo-data-stream Kinesis data stream.

Log in to the KDG URL with the user name and password you supplied during stack creation.
Change the Region to us-east-1.
Select the Kinesis data stream demo-data-stream.
For Records per second, choose Constant and enter 100 (it can be another number, depending on the rate of record creation).
On the Template 1 tab, enter the KDG data generation template:

{
"year": "{{random.number({"min":2000,"max":2022})}}",
"month": "{{random.number({"min":1,"max":12})}}",
"day": "{{random.number({"min":1,"max":30})}}",
"hour": "{{random.number({"min":0,"max":24})}}",
"minute": "{{random.number({"min":0,"max":60})}}",
"customerid": {{random.number({"min":5023,"max":59874})}},
"firstname" : "{{name.firstName}}",
"lastname" : "{{name.lastName}}",
"dateofbirth" : "{{date.past(70)}}",
"city" : "{{address.city}}",
"buildingnumber" : {{random.number({"min":63,"max":947})}},
"streetaddress" : "{{address.streetAddress}}",
"state" : "{{address.state}}",
"zipcode" : "{{address.zipCode}}",
"country" : "{{address.country}}",
"countrycode" : "{{address.countryCode}}",
"phonenumber" : "{{phone.phoneNumber}}",
"productname" : "{{commerce.productName}}",
"transactionamount": {{random.number(
{
"min":10,
"max":150
}
)}}
}

Choose Test template to test the sample records.
When the testing is correct, choose Send data.

This will start sending 100 records per second in the Kinesis data stream. (To stop sending data, choose Stop Sending Data to Kinesis.)

Integrate Iceberg with AWS Glue

To add the Apache Iceberg Connector for AWS Glue, complete the following steps. The connector is free to use and supports AWS Glue 1.0, 2.0, and 3.0.

On the AWS Glue console, choose AWS Glue Studio in the navigation pane.
In the navigation pane, navigate to AWS Marketplace.
Search for and choose Apache Iceberg Connector for AWS Glue.
Choose Accept Terms and Continue to Subscribe.
Choose Continue to Configuration.
For Fulfillment option, choose your AWS Glue version.
For Software version, choose the latest software version.
Choose Continue to Launch.
Under Usage Instructions, choose the link to activate the connector.
Enter a name for the connection, then choose Create connection and activate the connector.
Verify the new connector on the AWS Glue Studio Connectors.

Create the AWS Glue Data Catalog database

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location and schema of your data. You use the information in the Data Catalog to create and monitor your ETL jobs.

For this post, we create a Data Catalog database named icebergdemodb containing the metadata information of a table named customer, which will be queried through Athena.

On the AWS Glue console, choose Databases in the navigation pane.
Choose Add database.
For Database name, enter icebergdemodb.

This creates an AWS Glue database for metadata storage.

Create a Data Catalog table in Iceberg format

In this step, we create a Data Catalog table in Iceberg table format.

On the Athena console, create an Athena workgroup named demoworkgroup for SQL queries.
Choose Athena engine version 3 for Query engine version.

For more information about Athena versions, refer to Changing Athena engine versions.

Enter the S3 bucket location for Query result configuration under Additional configurations.
Open the Athena query editor and choose demoworkgroup.
Choose the database icebergdemodb.
Enter and run the following DDL to create a table pointing to the Data Catalog database icerbergdemodb. Note that the TBLPROPERTIES section mentions ICEBERG as the table type and LOCATION points to the S3 folder (custdata) URI created in earlier steps. This DDL command is available on the GitHub repo.

CREATE TABLE icebergdemodb.customer(
year string,
month string,
day string,
hour string,
minute string,
customerid string,
firstname string,
lastname string,
dateofbirth string,
city string,
buildingnumber string,
streetaddress string,
state string,
zipcode string,
country string,
countrycode string,
phonenumber string,
productname string,
transactionamount int)
LOCATION '<S3 Location URI>'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_target_data_file_size_bytes'='536870912',
'optimize_rewrite_delete_file_threshold'='10'
);

After you run the command successfully, you can see the table customer in the Data Catalog.

Create an AWS Glue streaming job

In this section, we create the AWS Glue streaming job, which fetches the record from the Kinesis data stream using the Spark script editor.

On the AWS Glue console, choose Jobs (new) in the navigation pane.
For Create job¸ select Spark script editor.
For Options¸ select Create a new script with boilerplate code.
Choose Create.
Enter the code available in the GitHub repo in the editor.

The sample code keeps appending data in the target location by fetching records from the Kinesis data stream.

Choose the Job details tab in the query editor.
For Name, enter Demo_Job.
For IAM role¸ choose demojobrole.
For Type, choose Spark Streaming.
For Glue Version, choose Glue 3.0.
For Language, choose Python 3.
For Worker type, choose G 0.25X.
Select Automatically scale the number of workers.
For Maximum number of workers, enter 5.
Under Advanced properties, select Use Glue Data Catalog as the Hive metastore.
For Connections, choose the connector you created.
For Job parameters, enter the following key pairs (provide your S3 bucket and account ID):

Key	Value
`--iceberg_job_catalog_warehouse`	`s3://streamingicebergdemo-XX/custdata/`
`--output_path`	`s3://streamingicebergdemo-XX`
`--kinesis_arn`	`arn:aws:kinesis:us-east-1:<AWS Account ID>:stream/demo-data-stream`
`--user-jars-first`	`True`

Choose Run to start the AWS Glue streaming job.
To monitor the job, choose Monitoring in the navigation pane.
Select Demo_Job and choose View run details to check the job run details and Amazon CloudWatch logs.

Run GDPR use cases on Athena

In this section, we demonstrate a few use cases that are relevant to GDPR alignment with the user data that’s stored in Iceberg format in the Amazon S3-based data lake as implemented in the previous steps. For this, let’s consider that the following requests are being initiated in the workflow to comply with the regulations:

Delete the records for the input customerid (for example, 59289)
Update phonenumber for the customerid (for example, 51842)

The IDs used in this example are samples only because they were created through the KDG template used earlier, which creates sample data. You can search for IDs in your implementation by querying through the Athena query editor. The steps remain the same.

Delete data by customer ID

Complete the following steps to fulfill the first use case:

On the Athena console, and make sure icebergdemodb is chosen as the database.
Open the query editor.
Enter the following query using a customer ID and choose Run:

SELECT count(*)
FROM icebergdemodb.customer
WHERE customerid = '59289';

This query gives the count of records for the input customerid before delete.

Enter the following query with the same customer ID and choose Run:

MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '59289') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN DELETE;

This query deletes the data for the input customerid as per the workflow generated.

Test if there is data with the customer ID using a count query.

The count should be 0.

Update data by customer ID

Complete the following steps to test the second use case:

On the Athena console, make sure icebergdemodb is chosen as the database.
Open the query editor.
Enter the following query with a customer ID and choose Run.

SELECT customerid, phonenumber
FROM icebergdemodb.customer
WHERE customerid = '51936';

This query gives the value for phonenumber before update.

Run the following query to update the required columns:

MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '51936') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN UPDATE SET phonenumber = '000';

This query updates the data to a dummy value.

Run the SELECT query to check the update.

You can see the data is updated correctly.

Vacuum table

A good practice is to run the VACUUM command periodically on the table because operations like INSERT, UPDATE, DELETE, and MERGE will take place on the Iceberg table. See the following code:

VACUUM icebergdemodb.customer;

Considerations

The following are a few considerations to keep in mind for this implementation:

Review the considerations and limitations for Athena support for Iceberg table format. For example, Athena only creates and operates on Iceberg v2 tables. To learn more about AWS integrations, refer to Iceberg AWS integrations.
Check the AWS Glue streaming ETL notes and restrictions for AWS Glue streaming jobs.

Clean up

Complete the following steps to clean up the resources you created for this post:

1. Delete the custdata folder in the S3 bucket.
2. Delete the CloudFormation stack.
3. Delete the Kinesis data stream.
4. Delete the S3 bucket storing the data.
5. Delete the AWS Glue job and Iceberg connector.
6. Delete the AWS Glue Data Catalog database and table.
7. Delete the Athena workgroup.
8. Delete the IAM roles and policies.

Conclusion

This post explained how you can use the Iceberg table format on Athena to implement GDPR use cases like data deletion and data upserts as required, when streaming data is being generated and ingested through AWS Glue streaming jobs in Amazon S3.

The operations for the Iceberg table that we demonstrated in this post aren’t all of the data operations that Iceberg supports. Refer to the Apache Iceberg documentation for details on various operations.

About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Rajdip Chaudhuri is Solutions Architect with Amazon Web Services specializing in data and analytics. He enjoys working with AWS customers and partners on data and analytics requirements. In his spare time, he enjoys soccer.