Tag Archives: Amazon Kinesis

Reference guide to analyze transactional data in near-real time on AWS

Post Syndicated from Jason Dalba original https://aws.amazon.com/blogs/big-data/reference-guide-to-analyze-transactional-data-in-near-real-time-on-aws/

Business leaders and data analysts use near-real-time transaction data to understand buyer behavior to help evolve products. The primary challenge businesses face with near-real-time analytics is getting the data prepared for analytics in a timely manner, which can often take days. Companies commonly maintain entire teams to facilitate the flow of data from ingestion to analysis.

The consequence of delays in your organization’s analytics workflow can be costly. As online transactions have gained popularity with consumers, the volume and velocity of data ingestion has led to challenges in data processing. Consumers expect more fluid changes to service and products. Organizations that can’t quickly adapt their business strategy to align with consumer behavior may experience loss of opportunity and revenue in competitive markets.

To overcome these challenges, businesses need a solution that can provide near-real-time analytics on transactional data with services that don’t lead to latent processing and bloat from managing the pipeline. With a properly deployed architecture using the latest technologies in artificial intelligence (AI), data storage, streaming ingestions, and cloud computing, data will become more accurate, timely, and actionable. With such a solution, businesses can make actionable decisions in near-real time, allowing leaders to change strategic direction as soon as the market changes.

In this post, we discuss how to architect a near-real-time analytics solution with AWS managed analytics, AI and machine learning (ML), and database services.

Solution overview

The most common workloads, agnostic of industry, involve transactional data. Transactional data volumes and velocity have continued to rapidly expand as workloads have been pushed online. Near-real-time data is data stored, processed, and analyzed on a continual basis. It generates information that is available for use almost immediately after being generated. With the power of near-real-time analytics, business units across an organization, including sales, marketing, and operations, can make agile, strategic decisions. Without the proper architecture to support near real-time analytics, organizations will be dependent on delayed data and will not be able to capitalize on emerging opportunities. Missed opportunities could impact operational efficiency, customer satisfaction, or product innovation.

Managed AWS Analytics and Database services allow for each component of the solution, from ingestion to analysis, to be optimized for speed, with little management overhead. It is crucial for critical business solutions to follow the six pillars of the AWS Well-Architected Framework. The framework helps cloud architects build the most secure, high performing, resilient, and efficient infrastructure for critical workloads.

The following diagram illustrates the solution architecture.

Solution architecture

By combining the appropriate AWS services, your organization can run near-real-time analytics off a transactional data store. In the following sections, we discuss the key components of the solution.

Transactional data storage

In this solution, we use Amazon DynamoDB as our transactional data store. DynamoDB is a managed NoSQL database solution that acts as a key-value store for transactional data. As a NoSQL solution, DynamoDB is optimized for compute (as opposed to storage) and therefore the data needs to be modeled and served up to the application based on how the application needs it. This makes DynamoDB good for applications with known access patterns, which is a property of many transactional workloads.

In DynamoDB, you can create, read, update, or delete items in a table through a partition key. For example, if you want to keep track of how many fitness quests a user has completed in your application, you can query the partition key of the user ID to find the item with an attribute that holds data related to completed quests, then update the relevant attribute to reflect a specific quests completion. There are also some added benefits of DynamoDB by design, such as the ability to scale to support massive global internet-scale applications while maintaining consistent single-digit millisecond latency performance, because the date will be horizontally partitioned across the underlying storage nodes by the service itself through the partition keys. Modeling your data here is very important so DynamoDB can horizontally scale based on a partition key, which is again why it’s a good fit for a transactional store. In transactional workloads, when you know what the access patterns are, it will be easier to optimize a data model around those patterns as opposed to creating a data model to accept ad hoc requests. All that being said, DynamoDB doesn’t perform scans across many items as efficiently, so for this solution, we integrate DynamoDB with other services to help meet the data analysis requirements.

Data streaming

Now that we have stored our workload’s transactional data in DynamoDB, we need to move that data to another service that will be better suited for analysis of said data. The time to insights on this data matters, so rather than send data off in batches, we stream the data into an analytics service, which helps us get the near-real time aspect of this solution.

We use Amazon Kinesis Data Streams to stream the data from DynamoDB to Amazon Redshift for this specific solution. Kinesis Data Streams captures item-level modifications in DynamoDB tables and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. Additionally, with the enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Kinesis Data Streams also provides durability and elasticity. The delay between the time a record is put into the stream and the time it can be retrieved (put-to-get delay) is typically less than 1 second. In other words, a Kinesis Data Streams application can start consuming the data from the stream almost immediately after the data is added. The managed service aspect of Kinesis Data Streams relieves you of the operational burden of creating and running a data intake pipeline. The elasticity of Kinesis Data Streams enables you to scale the stream up or down, so you never lose data records before they expire.

Analytical data storage

The next service in this solution is Amazon Redshift, a fully managed, petabyte-scale data warehouse service in the cloud. As opposed to DynamoDB, which is meant to update, delete, or read more specific pieces of data, Amazon Redshift is better suited for analytic queries where you are retrieving, comparing, and evaluating large amounts of data in multi-stage operations to produce a final result. Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes.

Beyond just the fact that Amazon Redshift is built for analytical queries, it can natively integrate with Amazon streaming engines. Amazon Redshift Streaming Ingestion ingests hundreds of megabytes of data per second, so you can query data in near-real time and drive your business forward with analytics. With this zero-ETL approach, Amazon Redshift Streaming Ingestion enables you to connect to multiple Kinesis data streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) data streams and pull data directly to Amazon Redshift without staging data in Amazon Simple Storage Service (Amazon S3). You can define a schema or choose to ingest semi-structured data with the SUPER data type. With streaming ingestion, a materialized view is the landing area for the data read from the Kinesis data stream, and the data is processed as it arrives. When the view is refreshed, Redshift compute nodes allocate each data shard to a compute slice. We recommend you enable auto refresh for this materialized view so that your data is continuously updated.

Data analysis and visualization

After the data pipeline is set up, the last piece is data analysis with Amazon QuickSight to visualize the changes in consumer behavior. QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are.

QuickSight connects to your data in the cloud and combines data from many different sources. In a single data dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more. As a fully managed cloud-based service, QuickSight provides enterprise-grade security, global availability, and built-in redundancy. It also provides the user-management tools that you need to scale from 10 users to 10,000, all with no infrastructure to deploy or manage.

QuickSight gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. They have secure access to dashboards from any device on your network and from mobile devices. Connecting QuickSight to the rest of our solution will complete the flow of data from being initially ingested into DynamoDB to being streamed into Amazon Redshift. QuickSight can create a visual analysis of the data in near-real time because that data is relatively up to date, so this solution can support use cases for making quick decisions on transactional data.

Using AWS for data services allows for each component of the solution, from ingestion to storage to analysis, to be optimized for speed and with little management overhead. With these AWS services, business leaders and analysts can get near-real-time insights to drive immediate change based on customer behavior, enabling organizational agility and ultimately leading to customer satisfaction.

Next steps

The next step to building a solution to analyze transactional data in near-real time on AWS would be to go through the workshop Enable near real-time analytics on data stored in Amazon DynamoDB using Amazon Redshift. In the workshop, you will get hands-on with AWS managed analytics, AI/ML, and database services to dive deep into an end-to-end solution delivering near-real-time analytics on transactional data. By the end of the workshop, you will have gone through the configuration and deployment of the critical pieces that will enable users to perform analytics on transactional workloads.

Conclusion

Developing an architecture that can serve transactional data to near-real-time analytics on AWS can help business become more agile in critical decisions. By ingesting and processing transactional data delivered directly from the application on AWS, businesses can optimize their inventory levels, reduce holding costs, increase revenue, and enhance customer satisfaction.

The end-to-end solution is designed for individuals in various roles, such as business users, data engineers, data scientists, and data analysts, who are responsible for comprehending, creating, and overseeing processes related to retail inventory forecasting. Overall, being able to analyze near-real time transactional data on AWS can provide businesses timely insight, allowing for quicker decision making in fast paced industries.


About the Authors

Jason D’Alba is an AWS Solutions Architect leader focused on database and enterprise applications, helping customers architect highly available and scalable database solutions.

Veerendra Nayak is a Principal Database Solutions Architect based in the Bay Area, California. He works with customers to share best practices on database migrations, resiliency, and integrating operational data with analytics and AI services.

Evan Day is a Database Solutions Architect at AWS, where he helps customers define technical solutions for business problems using the breadth of managed database services on AWS. He also focuses on building solutions that are reliable, performant, and cost efficient.

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

Post Syndicated from Roy Wang original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-celebrating-a-decade-of-real-time-data-innovation/

Data is a key strategic asset for every organization, and every company is a data business at its core. However, in many organizations, data is typically spread across a number of different systems such as software as a service (SaaS) applications, operational databases, and data warehouses. Such data silos make it difficult to get unified views of the data in an organization and act in real time to derive the most value.

Ten years ago, we launched Amazon Kinesis Data Streams, the first cloud-native serverless streaming data service, to serve as the backbone for companies, to move data across system boundaries, breaking data silos. With data streaming, you can power data lakes running on Amazon Simple Storage Service (Amazon S3), enrich customer experiences via personalization, improve operational efficiency with predictive maintenance of machinery in your factories, and achieve better insights with more accurate machine learning (ML) models. Amazon Kinesis Data Streams is a foundational data strategy pillar for tens of thousands of customers. As streams of raw data come together, they unlock capabilities to continuously transform, enrich, and query data in real time via seamless integration with stream processing engines such as Amazon Managed Service for Apache Flink.

As an example, the National Hockey League (NHL) reimagined the fan experience by streaming live NHL EDGE game data and stats to offer hockey fans valuable insights to keep fans at the edge of their seats. NHL EDGE technology in the puck and players’ sweaters (jerseys) generate thousands of data points every second for the NHL, which can be analyzed by AWS to predict likely outcomes for key events like face-offs. To process and analyze thousands of signals, the NHL built a real-time streaming data foundation with Kinesis Data Streams and Amazon Managed Service for Apache Flink to stream, prepare, and feed data into ML models, helping inform face-off predictions in seconds and expanding new ways to engage viewers.

Building on such streaming data foundations, many customers are currently thinking about how to deliver transformative new products and services with generative AI. Streaming allows companies to connect the data available within data stores to large language models (LLMs) securely and in real time. Although LLMs are capable of working with billions of parameters, in order to deliver an engaging experience that is tailored to a company’s customers, LLMs require personalization data for the company’s users and proprietary knowledge stores within the company’s data stores. A data strategy that incorporates streaming is necessary to deliver personalization and proprietary data that is available for querying in real time.

Customers with real-time streaming data strategy are at the cutting edge of providing innovative products with generative AI. One customer adopted Kinesis Data Streams for their data strategy, and they stream billions of events from their digital products to derive real-time insights. With a combination of low-latency data streaming and analytics, they are able to understand and personalize the user experience via a seamlessly integrated, self-reliant system for experimentation and automated feedback. Earlier this year, building on their already strong data foundation, they launched an innovative digital media generative AI product. The same data foundation built on Kinesis Data Streams is used to continuously analyze how users interact with the generated content and helps the product team fine-tune the application.

Real-time streaming data technologies are essential for digital transformation. These services help customers bring data to their applications and models, making them smarter. Real-time data gives companies an advantage in data-driven decisions, predictions, and insights by using the data at the very moment it is generated, providing an unparalleled edge in a world where timing is the key to success. Bring the data in once, use it across your organization, and act before the value of that data diminishes.”

– Mindy Ferguson, VP of AWS Streaming and Messaging.

As we celebrate the tenth anniversary of Kinesis Data Streams, customers have shared four key reasons they continue to value this revolutionary service. They love how they can easily stream data with no underlying servers to provision or manage, operate at a massive scale with consistent performance, achieve high resiliency and durability, and benefit from broad integration with myriad sources and sinks to ingest and process data respectively.

Ease of use

Getting started with Kinesis Data Streams is straightforward: developers can create a data stream with a few clicks on the Kinesis Data Streams console or with a single API call. Changing the size or configuration is also a single API call, and each data stream comes with a default 24-hour data retention period. Developers don’t have to worry about clusters, version upgrades, or storage capacity planning. They just turn on a data stream and start ingesting data.

The needs of our customers have evolved in the past 10 years. As more events get captured and streamed, customers want their data streams to scale elastically without any operational overhead. In response, we launched On-Demand streams in 2021 to provide a simple and automatic scaling experience. With On-Demand streams, you let the service handle scaling up a stream’s capacity proactively, and you’re only charged for the actual data ingested, retrieved, and stored. As our customers continued to ask for more capabilities, we increased the ingestion throughput limit of each On-Demand stream from 200MB/s to 1GB/s in March 2023, and then to 2GB/s in October 2023, to accommodate higher throughput workloads. To continue innovating to be the easiest streaming data service to use, we actively listen to our customer use cases.

Canva is an online design and visual communication platform. As it has rapidly grown from 30 million to 135 million monthly users, it has built a streaming data platform at scale that is effortless to operate for driving product innovations and personalizing the user experience.

“Amazon Kinesis Data Streams and AWS Lambda are used throughout Canva’s logging platform, ingesting and processing over 60 billion log events per day. The combination of Kinesis Data Streams and Lambda has abstracted plenty of work that’s often required in managing a massive data pipeline, such as deploying and managing a fleet of servers, whilst also providing a highly scalable and reliable service. It has allowed us to focus on delivering a world-class product by building highly requested features rather than spending time on operational work.”

– Phoebe Zhou, Software Engineer at Canva.

Operate at massive scale with consistent performance

A fundamental requirement of a streaming data strategy is ingesting and processing large volumes of data with low latency. Kinesis Data Streams processes trillions of records per day across tens of thousands of customers. Customers run more than 3.5 million unique streams and process over 45 PB of data per day. Our largest customers ingest more than 15 GB per second of real-time data with individual streams. That’s equivalent to streaming multiple data points for every person on earth, every second! Even at this scale, all our customers still retrieve data within milliseconds of availability.

Customers also want to process the same data with multiple applications, with each deriving a different value, without worrying about one application impacting the read throughput of another. Enhanced Fan-out offers dedicated read throughput and low latency for each data consumer. This has enabled enterprise platform teams to provide real-time data to more teams and applications.

VMware Carbon Black uses Kinesis Data Streams to ingest petabytes of data every day to secure millions of customer endpoints. The team focuses on its expertise while AWS manages data streaming to meet growing customer traffic and needs in real time.

“When an individual customer’s data increases or decreases, we can use the elasticity of Amazon Kinesis Data Streams to scale compute up or down to process data reliably while effectively managing our cost. This is why Kinesis Data Streams is a good fit. The biggest advantage is the managed nature of our solution on AWS. This has shaped our architecture and helped us shift complexity elsewhere.”

– Stoyan Dimkov, Staff Engineer and Software Architect at VMware Carbon Black.

Learn more about the case study.

Provide resiliency and durability for data streaming

With burgeoning data, customers want more flexibility in processing and reprocessing data. For example, if an application that is consuming data goes offline for a period, teams want to ensure that they resume processing at a later time without data loss. Kinesis Data Streams provides a default 24-hour retention period, enabling you to select a specific timestamp from which to start processing records. With the extended retention feature, you can configure the data retention period to be up to 7 days.

Some industries like financial services and healthcare have stricter compliance requirements, so customers asked for even longer data retention periods to support these requirements. Therefore, we followed up with long-term storage that supports data retention for up to 1 year. Now, thousands of Kinesis Data Streams customers use these features to make their streaming applications more resilient and durable.

Mercado Libre, a leading ecommerce and payments platform in Latin America, relies on Kinesis Data Streams to power its streaming data strategy around payment processing, customer experience, and operations.

“With Amazon Kinesis Data Streams at the core, we process approximately 70 billion daily messages distributed across thousands of data producers. By leveraging Kinesis Data Streams and Amazon DynamoDB Streams, we’ve embraced an event-driven architecture and are able to swiftly respond to data changes.”

– Joaquin Fernandez, Senior Software Expert at Mercado Libre.

Access your data no matter where it lives

Our customers use a wide variety of tools and applications, and an organization’s data often resides in many places. Therefore, the ability to easily integrate data across an organization is crucial to derive timely insights. Developers use the Kinesis Producer Library, Kinesis Client Library, and AWS SDK to quickly build custom data producer and data consumer applications. Customers have expanded their data producers ranging from microservices to smart TVs and even cars. We have over 40 integrations with AWS services and third-party applications like Adobe Experience Platform and Databricks. As detailed in our whitepaper on building a modern data streaming architecture on AWS, Kinesis Data Streams serves as the backbone to serverless and real-time use cases such as personalization, real-time insights, Internet of Things (IoT), and event-driven architecture. Our recent integration with Amazon Redshift enables you to ingest hundreds of megabytes of data from Kinesis Data Streams into data warehouses in seconds. To learn more about how to use this integration to detect fraud in near-real time, refer to Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML.

Another integration launched in 2023 is with Amazon Monitron to power predictive maintenance management. You can now stream measurement data and the corresponding inference results to Kinesis Data Streams, coordinate predictive maintenance, and build an IoT data lake. For more details, refer to Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis.

Next, let’s go back to the NHL use case where they combine IoT, data streaming, and machine learning.

The NHL Edge IQ powered by AWS is helping bring fans closer to the action with advanced analytics and new ML stats such as Face-off Probability and Opportunity Analysis.

“We use Amazon Kinesis Data Streams to process NHL EDGE data on puck and Player positions, face-off location, and the current game situation to decouple data producers from consuming applications. Amazon Managed Service for Apache Flink is used to run Flink applications and consumes data from Kinesis Data Streams to call the prediction model in Amazon SageMaker to deliver the real-time Face-off Probability metric. The probability results are also stored in Amazon S3 to continuously retrain the model in SageMaker. The success of this project led us to build the next metric, Opportunity Analysis, which delivers over 25 insights into the quality of the scoring opportunity presented by each shot on goal. Kinesis Data Streams and Amazon Managed Service for Apache Flink applications were critical to making live, in-game predictions, enabling the system to perform opportunity analysis calculations for up to 16 live NHL games simultaneously.”

– Eric Schneider, SVP, Software Engineering at National Hockey League.

Learn more about the case study.

The future of data is real time

The fusion of real-time data streaming and generative AI promises to be the cornerstone of our digitally connected world. Generative AI, empowered by a constant influx of real-time information from IoT devices, sensors, social media, and beyond, is becoming ubiquitous. From autonomous vehicles navigating dynamically changing traffic conditions to smart cities optimizing energy consumption based on real-time demand, the combination of AI and real-time data will underpin efficiency and innovation across industries. Ubiquitous, adaptive, and deeply integrated into our lives, these AI-driven applications will enhance convenience and address critical challenges such as climate change, healthcare, and disaster response by using the wealth of real-time insights at their disposal. With Kinesis Data Streams, organizations can build a solid data foundation, positioning you to quickly adopt new technologies and unlock new opportunities sooner—which we anticipate will be enormous.

Learn more about what our customers are doing with data streaming. If you would like a quick exploration of Kinesis Data Streams concepts and use cases, check out our Amazon Kinesis Data Streams 101 playlist. To get started with building your data streams, visit the Amazon Kinesis Data Streams Developer Guide.


About the author

Roy (KDS) Wang is a Senior Product Manager with Amazon Kinesis Data Streams. He is passionate about learning from and collaborating with customers to help organizations run faster and smarter. Outside of work, Roy strives to be a good dad to his new son and builds plastic model kits.

Processing large records with Amazon Kinesis Data Streams

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/processing-large-records-with-amazon-kinesis-data-streams/

In today’s digital era, data is abundant and constantly flowing. Businesses across industries are seeking ways to harness this wealth of information to gain valuable insights and make real-time decisions. To meet this need, AWS offers Amazon Kinesis Data Streams, a powerful and scalable real-time data streaming service. With Kinesis Data Streams, you can effortlessly collect, process, and analyze streaming data in real time at any scale. This service seamlessly integrates into your data architecture, allowing you to tap into the full potential of your data for informed decision-making.

Data streaming technologies like Kinesis Data Streams are designed to efficiently process and manage continuous streams of data in real time at large scale. The individual pieces of data within these streams are often referred to as records. In scenarios like large file processing or performing image, audio, or video analytics, your record may exceed 1 MB. You may struggle to ingest such a large record with Kinesis Data Streams because, as of this writing, the service has a 1 MB upper limit for maximum data record size.

In this post, we show you some different options for handling large records within Kinesis Data Streams and the benefits and disadvantages of each approach. We provide some sample code for each option to help you get started with any of these approaches with your own workloads.

Understanding the default behavior of Kinesis Data Streams

You can send records to Kinesis Data Streams using the PutRecord or PutRecords API calls. These APIs include a mandatory field known as PartitionKey, where you must provide a specific value. This partition key is used by the service to map records with the same partition keys to the same shard to ensure ordering and locality for consumption. Locality means that you want the same consumer to process all records for a given partition key. This helps ensure that data with the same partition key stays together within the same shard, maintaining data order.

Each shard, which holds your data, can handle writing up to 1 MB per second. Let’s consider a scenario where you define a partition key and attempt to send a data record that exceeds 1 MB in size. Based on the explanation so far, the service will reject this request because the record size is over 1 MB. To help you understand better, we experimented by trying to send a record of 1.5 MB to a stream, and the outcome was the following exception message:

import json
import boto3
client = boto3.client('kinesis', region_name='ap-southeast-2')

def lambda_handler(event, context):
    try:
        response = client.put_record(
            StreamName='test',
            Data=b'Sample 1 MB....',
            PartitionKey='string'
            #StreamARN='string'
        )
    
    except Exception as e:
        print (e)

START RequestId: 84b3ab0c-3f30-4267-aec1-549c2d59dfdb Version: $LATEST An error occurred (ValidationException) when calling the PutRecord operation: 1 validation error detected: Value at 'data' failed to satisfy constraint: Member must have length less than or equal to 1048576 END RequestId: 84b3ab0c-3f30-4267-aec1-549c2d59dfdb

Strategies for handling large records

Now that we understand the behavior of the PutRecord and PutRecords APIs, let’s discuss strategies you can use to overcome this situation. One thing to keep in mind is that there is no single best solution; in the following sections, we discuss some of the approaches that you can evaluate based on your use case:

  • Store large records in Amazon Simple Storage Service (Amazon S3) with a reference in Kinesis Data Streams
  • Split one large record into multiple records
  • Compress your large records

Let’s discuss these points one by one.

Store large records in Amazon S3 with a reference in Kinesis Data Streams

A useful approach for storing large records involves utilizing an alternative storage solution while employing a reference within Kinesis Data Streams. In this context, Amazon S3 stands out as an excellent choice due to its exceptional durability and cost-effectiveness. The procedure involves uploading the record as an object to an S3 bucket and subsequently writing a reference entry in Kinesis Data Streams. This entry incorporates an attribute that serves as a pointer, indicating the location of the object within Amazon S3.

With this approach, you can generate a pre-signed URL associated with the S3 object’s location. This link can be shared with the requester, offering them direct access to the object without the need for intermediary server-side data transfers.

The following diagram illustrates the architecture of this solution.

The following is the sample code to write data to Kinesis Data Streams using this approach:

import json
import boto3
import random

def lambda_handler(event, context):
    try:
        s3 = boto3.client('s3', region_name='ap-southeast-2')
        kds = boto3.client('kinesis', region_name='ap-southeast-2')
        expiration=3600
        pk=str(random.randint(100,100000000))
        bucket_name = 'MY_BUCKET'
        object_key = 'air/' + pk + '.txt'
        file_content = b'LARGE OBJECT'
        response = s3.put_object(Bucket=bucket_name, Key=object_key, Body=file_content)
        presigned_url = s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket_name, 'Key': object_key},
            ExpiresIn=expiration
        )
        
        kdata = {'message': presigned_url}
        response = kds.put_record(
            StreamName='test',
            Data=json.dumps(kdata),
            PartitionKey=pk
        )
        print (response)
    except Exception as e:
        print (e)

If you are using an AWS Lambda consumer to process this data, you can now decode the record to get the S3 pre-signed URL to efficiently retrieve the object from Amazon S3. Then you can implement your business logic to effectively process the data. The following is sample code for reference:

import json
import base64
import json

def lambda_handler(event, context):
    item = None
    decoded_record_data = [base64.b64decode(record['kinesis']['data']).decode().replace('\n','') for record in event['Records']]
    deserialized_data = [json.loads(decoded_record) for decoded_record in decoded_record_data]
    
    
    for item in deserialized_data:
        LOB=(item['message'])
        #process LOB implementing your business logic

An inherent benefit of adopting this technique is the capability to store data in Amazon S3, accommodating an extensive range of sizes per individual object. This method helps you reduce the costs of using Kinesis Data Streams because it uses less storage space and requires fewer read and write throughput for item access. This optimization is achieved by storing just the URL within Kinesis Data Streams. However, it’s important to acknowledge that accessing the sizable object necessitates an additional call to Amazon S3, thereby introducing higher latency for clients as they manage the additional request.

Split one large record into multiple records

Splitting large records into smaller ones in Kinesis Data Streams brings advantages like faster processing, improved throughput, efficient resource use, and more straightforward error handling. Let’s say you have a large record that you want to split into smaller chunks before sending them to a Kinesis data stream. First, you need to set up a Kinesis producer. Suppose you have a large record as a string. You can split it into smaller chunks of a predefined size. For this example, let’s say you’re splitting the record into chunks of 100 characters each. After you split that, loop through the record chunks and send each chunk as a separate message to a Kinesis data stream. The following is the sample code:

import boto3
kinesis = boto3.client('kinesis', region_name='ap-southeast-2')  

def split_record(record, chunk_size):
    chunks = [record[i:i + chunk_size] for i in range(0, len(record), chunk_size)]
    return chunks

def send_to_kinesis(stream_name, record):
    response = kinesis.put_record(
        StreamName=stream_name,
        Data=record,
        PartitionKey= '100'
    )
    return response

def main():
    stream_name = 'test'  
    large_record = 'Your large record'  # Replace with your actual record
    chunk_size = 100  

    record_chunks = split_record(large_record, chunk_size)

    for chunk in record_chunks:
        response = send_to_kinesis(stream_name, chunk)
        print(f"Record sent: {response['SequenceNumber']}")

if __name__ == "__main__":
    main()

Ensure that all chunks of a given message are directed to a single partition, thereby guaranteeing the preservation of their order. In the final chunk, include metadata within the header indicating the conclusion of the message during production. This enables consumers to identify the ultimate chunk and facilitates seamless message reconstruction. The drawback of this method is that it adds complexity to the client-side tasks of dividing and putting back together the different parts. Therefore, these functions need thorough testing to prevent any loss of data.

Compress your large records

Applying data compression prior to transmitting it to Kinesis Data Streams has numerous advantages. This approach not only reduces the data’s size, enabling swifter travel and more efficient utilization of network resources, but also leads to cost savings in terms of storage expenses while optimizing overall resource consumption. Additionally, this practice simplifies storage and data retention. By using compression algorithms such as GZIP, Snappy, or LZ4, you can achieve substantial reduction in the size of large records. Compression brings the benefit of simplicity because it’s implemented seamlessly without requiring the caller to make changes to the item or use extra AWS services to support storage. However, compression introduces additional CPU overhead and latency on the producer side, and its impact on the compression ratio and efficiency can vary depending on the data type and format. Also, compression can enhance consumer throughput at the expense of some decompression overhead.

Conclusion

For real-time data streaming use cases, it’s essential to carefully consider the handling of large records when using Kinesis Data Streams. In this post, we discussed the challenges associated with managing large records and explored strategies such as utilizing Amazon S3 references, record splitting, and compression. Each approach has its own set of benefits and drawbacks, so it’s crucial to evaluate the nature of your data and the tasks you need to perform. Select the most suitable approach based on your data’s characteristics and your processing task requirements.

We encourage you to try out the approaches discussed in this post and share your thoughts in the comments section.


About the author

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Post Syndicated from Bhupesh Sharma original https://aws.amazon.com/blogs/big-data/modernize-a-legacy-real-time-analytics-application-with-amazon-managed-service-for-apache-flink/

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. To reap the benefits of cloud computing, like increased agility and just-in-time provisioning of resources, organizations are migrating their legacy analytics applications to AWS. The lift and shift migration approach is limited in its ability to transform businesses because it relies on outdated, legacy technologies and architectures that limit flexibility and slow down productivity. In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink.

Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities. In this post, we discuss challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions. We introduce you to Amazon Managed Service for Apache Flink Studio and get started querying streaming data interactively using Amazon Kinesis Data Streams.

Solution overview

In this post, we walk through a call center analytics solution that provides insights into the call center’s performance in near-real time through metrics that determine agent efficiency in handling calls in the queue. Key performance indicators (KPIs) of interest for a call center from a near-real-time platform could be calls waiting in the queue, highlighted in a performance dashboard within a few seconds of data ingestion from call center streams. These metrics help agents improve their call handle time and also reallocate agents across organizations to handle pending calls in the queue.

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases. However, as data loses its relevance with time, transformations in a near-real-time analytics platform need only the latest data from the streams to generate insights. This may require frequent truncation in certain tables to retain only the latest stream of events. Also, the need to derive near-real-time insights within seconds requires frequent materialized view refreshes in this traditional relational database approach. Frequent materialized view refreshes on top of constantly changing base tables due to streamed data can lead to snapshot isolation errors. Also, a data model that allows table truncations at a regular frequency (for example, every 15 seconds) to store only relevant data in tables can cause locking and performance issues.

The following diagram provides the high-level architecture of a legacy call center analytics platform. In this traditional architecture, a relational database is used to store data from streaming data sources. Datasets used for generating insights are curated using materialized views inside the database and published for business intelligence (BI) reporting.

Modernizing this traditional database-driven architecture in the AWS Cloud allows you to use sophisticated streaming technologies like Amazon Managed Service for Apache Flink, which are built to transform and analyze streaming data in real time. With Amazon Managed Service for Apache Flink, you can gain actionable insights from streaming data with serverless, fully managed Apache Flink. You can use Amazon Managed Service for Apache Flink to quickly build end-to-end stream processing applications and process data continuously, getting insights in seconds or minutes. With Amazon Managed Service for Apache Flink, you can use Apache Flink code or Flink SQL to continuously generate time-series analytics over time windows and perform sophisticated joins across streams.

The following architecture diagram illustrates how a legacy call center analytics platform running on databases can be modernized to run on the AWS Cloud using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. You can perform sophisticated joins over these streaming datasets and create views on top of it using Amazon Managed Service for Apache Flink to generate KPIs required for the business using Amazon OpenSearch Service. You can analyze streaming data interactively using managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio in near-real time. The near-real-time insights can then be visualized as a performance dashboard using OpenSearch Dashboards.

In this post, you perform the following high-level implementation steps:

  1. Ingest data from streaming data sources to Kinesis Data Streams.
  2. Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the stream data within seconds of data ingestion.
  3. Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards.

Prerequisites

This post requires you to set up the Amazon Kinesis Data Generator (KDG) to send data to a Kinesis data stream using an AWS CloudFormation template. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

We use two datasets in this post. The first dataset is fact data, which contains call center organization data. The KDG generates a fact data feed in every 15 seconds that contains the following information:

  • AgentId – Agents work in a call center setting surrounded by other call center employees answering customers’ questions and referring them to the necessary resources to solve their problems.
  • OrgId – A call center contains different organizations and departments, such as Care Hub, IT Hub, Allied Hub, Prem Hub, Help Hub, and more.
  • QueueId – Call queues provide an effective way to route calls using simple or sophisticated patterns to ensure that all calls are getting into the correct hands quickly.
  • WorkMode – An agent work mode determines your current state and availability to receive incoming calls from the Automatic Call Distribution (ACD) and Direct Agent Call (DAC) queues. Call Center Elite does not route ACD and DAC calls to your phone when you are in an Aux mode or ACW mode.
  • WorkSkill – Working as a call center agent requires several soft skills to see the best results, like problem-solving, bilingualism, channel experience, aptitude with data, and more.
  • HandleTime – This customer service metric measures the length of a customer’s call.
  • ServiceLevel – The call center service level is defined as the percentage of calls answered within a predefined amount of time—the target time threshold. It can be measured over any period of time (such as 30 minutes, 1 hour, 1 day, or 1 week) and for each agent, team, department, or the company as a whole.
  • WorkStates – This specifies what state an agent is in. For example, an agent in an available state is available to handle calls from an ACD queue. An agent can have several states with respect to different ACD devices, or they can use a single state to describe their relationship to all ACD devices. Agent states are reported in agent-state events.
  • WaitingTime – This is the average time an inbound call spends waiting in the queue or waiting for a callback if that feature is active in your IVR system.
  • EventTime – This is the time when the call center stream is sent (via the KDG in this post).

The following fact payload is used in the KDG to generate sample fact data:

{
"AgentId" : {{random.number(
{
"min":2001,
"max":2005
}
)}},
"OrgID" : {{random.number(
{
"min":101,
"max":105
}
)}},
"QueueId" : {{random.number(
{
"min":1,
"max":5
}
)}},
"WorkMode" : "{{random.arrayElement(
["TACW","ACW","AUX"]
)}}",
"WorkSkill": "{{random.arrayElement(
["Problem Solving","Bilingualism","Channel experience","Aptitude with data"]
)}}",
"HandleTime": {{random.number(
{
"min":1,
"max":150
}
)}},
"ServiceLevel":"{{random.arrayElement(
["Sev1","Sev2","Sev3"]
)}}",
"WorkSates": "{{random.arrayElement(
["Unavailable","Available","On a call"]
)}}",
"WaitingTime": {{random.number(
{
"min":10,
"max":150
}
)}},
"EventTime":"{{date.utc("YYYY-MM-DDTHH:mm:ss")}}"
}

The following screenshot shows the output of the sample fact data in an Amazon Managed Service for Apache Flink notebook.

The second dataset is dimension data. This data contains metadata information like organization names for their respective organization IDs, agent names, and more. The frequency of the dimension dataset is twice a day, whereas the fact dataset gets loaded in every 15 seconds. In this post, we use Amazon Simple Storage Service (Amazon S3) as a data storage layer to store metadata information (Amazon DynamoDB can be used to store metadata information as well). We use AWS Lambda to load metadata from Amazon S3 to another Kinesis data stream that stores metadata information. The following JSON file stored in Amazon S3 has metadata mappings to be loaded into the Kinesis data stream:

[{"OrgID": 101,"OrgName" : "Care Hub","Region" : "APAC"},
{"OrgID" : 102,"OrgName" : "Prem Hub","Region" : "AMER"},
{"OrgID" : 103,"OrgName" : "IT Hub","Region" : "EMEA"},
{"OrgID" : 104,"OrgName" : "Help Hub","Region" : "EMEA"},
{"OrgID" : 105,"OrgName" : "Allied Hub","Region" : "LATAM"}]

Ingest data from streaming data sources to Kinesis Data Streams

To start ingesting your data, complete the following steps:

  1. Create two Kinesis data streams for the fact and dimension datasets, as shown in the following screenshot. For instructions, refer to Creating a Stream via the AWS Management Console.

  1. Create a Lambda function on the Lambda console to load metadata files from Amazon S3 to Kinesis Data Streams. Use the following code:
    import boto3
    import json
    
    # Create S3 object
    s3_client = boto3.client("s3")
    S3_BUCKET = '<S3 Bucket Name>'
    kinesis_client = boto3.client("kinesis")
    stream_name = '<Kinesis Stream Name>'
        
    def lambda_handler(event, context):
      
      # Read Metadata file on Amazon S3
      object_key = "<S3 File Name.json>"  
      file_content = s3_client.get_object(
          Bucket=S3_BUCKET, Key=object_key)["Body"].read()
      
      #Decode the S3 object to json
      decoded_data = file_content.decode("utf-8").replace("'", '"')
      json_data = json.dumps(decoded_data)
      
      #Upload json data to Kinesis data stream
      partition_key = 'OrgID'
      for record in json_data:
        response = kinesis_client.put_record(
          StreamName=stream_name,
          Data=json.dumps(record),
          PartitionKey=partition_key)

Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the streaming data

The next step is to create tables in Amazon Managed Service for Apache Flink Studio for further transformations (joins, aggregations, and so on). To set up and query Kinesis Data Streams using Amazon Managed Service for Apache Flink Studio, refer to Query your data streams interactively using Amazon Managed Service for Apache Flink Studio and Python and create an Amazon Managed Service for Apache Flink notebook. Then complete the following steps:

  1. In the Amazon Managed Service for Apache Flink Studio notebook, create a fact table from the facts data stream you created earlier, using the following query.

The event time attribute is defined using a WATERMARK statement in the CREATE table DDL. A WATERMARK statement defines a watermark generation expression on an existing event time field, which marks the event time field as the event time attribute.

The event time refers to the processing of streaming data based on timestamps that are attached to each row. The timestamps can encode when an event happened. Processing time (PROCTime) refers to the machine’s system time that is running the respective operation.

%flink.ssql
CREATE TABLE <Fact Table Name> (
AgentId INT,
OrgID INT,
QueueId BIGINT,
WorkMode VARCHAR,
WorkSkill VARCHAR,
HandleTime INT,
ServiceLevel VARCHAR,
WorkSates VARCHAR,
WaitingTime INT,
EventTime TIMESTAMP(3),
WATERMARK FOR EventTime AS EventTime - INTERVAL '4' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '<fact stream name>',
'aws.region' = '<AWS region ex. us-east-1>',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

  1. Create a dimension table in the Amazon Managed Service for Apache Flink Studio notebook that uses the metadata Kinesis data stream:
    %flink.ssql
    CREATE TABLE <Metadata Table Name> (
    AgentId INT,
    AgentName VARCHAR,
    OrgID INT,
    OrgName VARCHAR,
    update_time as CURRENT_TIMESTAMP,
    WATERMARK FOR update_time AS update_time
    )
    WITH (
    'connector' = 'kinesis',
    'stream' = '<Metadata Stream Name>',
    'aws.region' = '<AWS region ex. us-east-1> ',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
    );

  1. Create a versioned view to extract the latest version of metadata table values to be joined with the facts table:
    %flink.ssql(type=update)
    CREATE VIEW versioned_metadata AS 
    SELECT OrgID,OrgName
      FROM (
          SELECT *,
          ROW_NUMBER() OVER (PARTITION BY OrgID
             ORDER BY update_time DESC) AS rownum 
          FROM <Metadata Table>)
    WHERE rownum = 1;

  1. Join the facts and versioned metadata table on orgID and create a view that provides the total calls in the queue in each organization in a span of every 5 seconds, for further reporting. Additionally, create a tumble window of 5 seconds to receive the final output in every 5 seconds. See the following code:
    %flink.ssql(type=update)
    
    CREATE VIEW joined_view AS
    SELECT streamtable.window_start, streamtable.window_end,metadata.OrgName,streamtable.CallsInQueue
    FROM
    (SELECT window_start, window_end, OrgID, count(QueueId) as CallsInQueue
      FROM TABLE(
        TUMBLE( TABLE <Fact Table Name>, DESCRIPTOR(EventTime), INTERVAL '5' SECOND))
      GROUP BY window_start, window_end, OrgID) as streamtable
    JOIN
        <Metadata table name> metadata
        ON metadata.OrgID = streamtable.OrgID

  1. Now you can run the following query from the view you created and see the results in the notebook:
%flink.ssql(type=update)

SELECT jv.window_end, jv.CallsInQueue, jv.window_start, jv.OrgName
FROM joined_view jv where jv.OrgName = ‘Prem Hub’ ;

Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards

You can publish the metrics generated within Amazon Managed Service for Apache Flink Studio to OpenSearch Service and visualize metrics in near-real time by creating a call center performance dashboard, as shown in the following example. Refer to Stream the data and validate output to configure OpenSearch Dashboards with Amazon Managed Service for Apache Flink. After you configure the connector, you can run the following command from the notebook to create an index in an OpenSearch Service cluster.

%flink.ssql(type=update)

drop table if exists active_call_queue;
CREATE TABLE active_call_queue (
window_start TIMESTAMP,
window_end TIMESTAMP,
OrgID int,
OrgName varchar,
CallsInQueue BIGINT,
Agent_cnt bigint,
max_handle_time bigint,
min_handle_time bigint,
max_wait_time bigint,
min_wait_time bigint
) WITH (
‘connector’ = ‘elasticsearch-7’,
‘hosts’ = ‘<Amazon OpenSearch host name>’,
‘index’ = ‘active_call_queue’,
‘username’ = ‘<username>’,
‘password’ = ‘<password>’
);

The active_call_queue index is created in OpenSearch Service. The following screenshot shows an index pattern from OpenSearch Dashboards.

Now you can create visualizations in OpenSearch Dashboards. The following screenshot shows an example.

Conclusion

In this post, we discussed ways to modernize a legacy, on-premises, real-time analytics architecture and build a serverless data analytics solution on AWS using Amazon Managed Service for Apache Flink. We also discussed challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions.

If you have any questions or suggestions, please leave us a comment.


About the Authors

Bhupesh Sharma is a Senior Data Engineer with AWS. His role is helping customers architect highly-available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys playing musical instruments, road biking, and swimming.

Devika Singh is a Senior Data Engineer at Amazon, with deep understanding of AWS services, architecture, and cloud-based best practices. With her expertise in data analytics and AWS cloud migrations, she provides technical guidance to the community, on architecting and building a cost-effective, secure and scalable solution, that is driven by business needs. Beyond her professional pursuits, she is passionate about classical music and travel.

Perform Amazon Kinesis load testing with Locust

Post Syndicated from Luis Morales original https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/

Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per second, and it’s challenging for developers to simulate high-traffic Amazon Kinesis-based applications to generate such load easily.

Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose are capable of capturing and storing terabytes of data per hour from numerous sources. Creating Kinesis data streams or Firehose delivery streams is straightforward through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Kinesis API. However, generating a continuous stream of test data requires a custom process or script to run continuously. Although the Amazon Kinesis Data Generator (KDG) provides a user-friendly UI for this purpose, it has some limitations, such as bandwidth constraints and increased round trip latency. (For more information on the KDG, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.)

To overcome these limitations, this post describes how to use Locust, a modern load testing framework, to conduct large-scale load testing for a more comprehensive evaluation of the streaming data solution.

Overview

This project emits temperature sensor readings via Locust to Kinesis. We set up the Amazon Elastic Compute Cloud (Amazon EC2) Locust instance via the AWS Cloud Development Kit (AWS CDK) to load test Kinesis-based applications. You can access the Locust dashboard to perform and observe the load test and connect via Session Manager, a capability of AWS Systems Manager, for configuration changes. The following diagram illustrates this architecture.

Architecture overview

In our testing with the largest recommended instance (c7g.16xlarge), the setup was capable of emitting over 1 million events per second to Kinesis data streams in on-demand capacity mode, with a batch size (simulated users per Locust user) of 500. You can find more details on what this means and how to configure the load test later in this post.

Locust overview

Locust is an open-source, scriptable, and scalable performance testing tool that allows you to define user behavior using Python code. It offers an easy-to-use interface, making it developer-friendly and highly expandable. With its distributed and scalable design, Locust can simulate millions of simultaneous users to mimic real user behavior during a performance test.

Each Locust user represents a scenario or a specific set of actions that a real user might perform on your system. When you run a performance test with Locust, you can specify the number of concurrent Locust users you want to simulate, and Locust will create an instance for each user, allowing you to assess the performance and behavior of your system under different user loads.

For more information on Locust, refer to the Locust documentation.

Prerequisites

To get started, clone or download the code from the GitHub repository.

Test locally

To test Locust out locally first before deploying it to the cloud, you have to install the necessary Python dependencies. If you’re new to Python, refer the README for more information on getting started.

Navigate to the load-test directory and run the following code:

pip install -r requirements.txt

To send events to a Kinesis data stream from your local machine, you will need to have AWS credentials. For more information, refer to Configuration and credential file settings.

To perform the test locally, stay in the load-test directory and run the following code:

locust -f locust-load-test.py

You can now access the Locust dashboard via http://0.0.0.0:8089/. Enter the number of Locust users, the spawn rate (users added per second), and the target Amazon Kinesis data stream name for Host. By default, it deploys the Kinesis data stream DemoStream that you can use for testing.

Locust Dashboard - Enter details

To see the generated events logged, run the following command, which filters only Locust and root logs (for example, no Botocore logs):

locust -f locust-load-test.py --loglevel DEBUG 2&gt;&amp;1 | grep -E "(locust|root)"

Set up resources with the AWS CDK

The GitHub repository contains the AWS CDK code to create all the necessary resources for the load test. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

  1. If not already downloaded, clone the GitHub repository to your local computer using the following command:
git clone https://github.com/aws-samples/amazon-kinesis-load-testing-with-locust
  1. Download and install the latest Node.js.
  2. Navigate to the root folder of the project and run the following command to install the latest version of AWS CDK:
npm install -g aws-cdk
  1. Install the necessary dependencies:
npm install
  1. Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command:
cdk bootstrap

To learn more about the bootstrapping process, refer to Bootstrapping.

  1. After the dependencies are installed, you can run the following command to deploy the stack of the AWS CDK template, which sets up the infrastructure within 5 minutes:
cdk deploy

The template sets up the Locust EC2 test instance, which is by default a c7g.xlarge instance, which at the time of publishing costs approximately $0.145 per hour in us-east-1. To find the most accurate pricing information, see Amazon EC2 On-Demand Pricing. You can find more details on how to change your instance size according to your scale of load testing later in this post.

It’s crucial to consider that the expenses incurred during load testing are not solely attributed to EC2 instance costs, but also heavily influenced by data transfer costs.

Accessing the Locust dashboard

You can access the dashboard by using the AWS CDK output KinesisLocustLoadTestingStack.locustdashboardurl to open the dashboard, for example http://1.2.3.4:8089.

The Locust dashboard is password protected. By default, it’s set to user name locust-user and password locust-dashboard-pwd.

With the default configuration, you can achieve up to 15,000 emitted events per second. Enter the number of Locust users (times the batch size), the spawn rate (users added per second), and the target Kinesis data stream name for Host.

Locust Dashboard - Enter details

After you have started the load test, you can look at the load test on the Charts tab.

Locust Dashboard - Charts

You can also monitor the load test on the Kinesis Data Streams console by navigating to the stream that you are load testing. If you used the default settings, navigate to DemoStream. On the detail page, choose the Monitoring tab to see the ingested load.

Kinesis Data Streams - Monitoring

Adapt workloads

By default, this project generates random temperature sensor readings for every sensor with the following format:

{
    "sensorId": "bfbae19c-2f0f-41c2-952b-5d5bc6e001f1_1",
    "temperature": 147.24,
    "status": "OK",
    "timestamp": 1675686126310
}

The project comes packaged with Faker, which you can use to adapt the payload to your needs. You just have to update the generate_sensor_reading function in the locust-load-test.py file:

class SensorAPIUser(KinesisBotoUser):
    # ...

    def generate_sensor_reading(self, sensor_id, sensor_reading):
        current_temperature = round(10 + random.random() * 170, 2)

        if current_temperature > 160:
            status = "ERROR"
        elif current_temperature > 140 or random.randrange(1, 100) > 80:
            status = random.choice(["WARNING", "ERROR"])
        else:
            status = "OK"

        return {
            'sensorId': f"{sensor_id}_{sensor_reading}",
            'temperature': current_temperature,
            'status': status,
            'timestamp': round(time.time()*1000)
        }

    # ...

Change configurations

After the initial deployment of the load testing tool, you can change configuration in two ways:

  1. Connect to the EC2 instance, make any configuration and code changes, and restart the Locust process
  2. Change the configuration and load testing code locally and redeploy it via cdk deploy

The first option helps you iterate more quickly on the remote instance without a need to redeploy. The latter uses the infrastructure as code (IaC) approach and makes sure that your configuration changes can be committed to your source control system. For a fast development cycle, it’s recommended to test your load test configuration locally first, connect to your instance to apply the changes, and after successful implementation, codify it as part of your IaC repository and then redeploy.

Locust is created on the EC2 instance as a systemd service and can therefore be controlled with systemctl. If you want to change the configuration of Locust as needed without redeploying the stack, you can connect to the instance via Systems Manager, navigate to the project directory on /usr/local/load-test, change the locust.env file, and restart the service by running sudo systemctl restart locust.

Large-scale load testing

This setup is capable of emitting over 1 million events per second to Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge.

To achieve peak performance with Locust and Kinesis, keep the following in mind:

  • Instance size – Your performance is bound by the underlying EC2 instance, so refer to EC2 instance type for more information about scaling. To set the correct instance size, you can configure the instance size in the file kinesis-locust-load-testing.ts.
  • Number of secondaries – Locust benefits from a distributed setup. Therefore, the setup spins up a primary, which does the coordination, and multiple secondaries, which do the actual work. To fully take advantage of the cores, you should specify one secondary per core. You can configure the number in the locust.env file.
  • Batch size – The amount of Kinesis data stream events you can send per Locust user is limited due to the resource overhead of switching Locust users and threads. To overcome this, you can configure a batch size to define how much users are simulated per Locust user. These are sent as a Kinesis data stream put_records call. You can configure the number in the locust.env file.

This setup is capable of emitting over 1 million events per second to the Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge instance.

Locust Dashboard - Large Scale Load Test Charts

You can observe this on the Monitoring tab for the Kinesis data stream as well.

Kinesis Data Stream - Large Scale Load Test Monitoring

Clean up

In order to not incur any unnecessary costs, delete the stack by running the following code:

cdk destroy

Summary

Kinesis is already popular for its ease of use among users building streaming applications. With this load testing capability using Locust, you can now test your workloads in a more straightforward and faster way. Visit the GitHub repo to embark on your testing journey.

The project is licensed under the Apache 2.0 license, providing the freedom to clone and modify it according to your needs. Furthermore, you can contribute to the project by submitting issues or pull requests via GitHub, fostering collaboration and improvement in the testing ecosystem.


About the author

Luis Morales works as Senior Solutions Architect with digital native businesses to support them in constantly reinventing themselves in the cloud. He is passionate about software engineering, cloud-native distributed systems, test-driven development, and all things code and security

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An RDS database instance (source).
  • An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
  • A Kinesis data stream.
  • Four AWS Glue Python shell jobs:
    • rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
    • rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
    • rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
    • rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
  • AWS Identity and Access Management (IAM) users and policies.
  • An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
    • We use private subnets for the RDS database instance and AWS DMS replication instance.
    • We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack
  3. Choose Next.
  4. For S3BucketName, enter the name of your new S3 bucket.
  5. For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
  6. For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  7. For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  8. For SubnetAzA and SubnetAzB, choose the subnets you want to use.
  9. For DatabaseUserName, enter your database user name.
  10. For DatabaseUserPassword, enter your database user password.
  11. Choose Next.
  12. On the next page, choose Next.
  13. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  14. Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task rds-to-kinesis-<CloudFormation stack name>.
  3. On the Actions menu, choose Restart/Resume.
  4. Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
  3. Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

  1. On the Athena console, complete the following steps if you’re running an Athena query for the first time:
    • On the Settings tab, choose Manage.
    • Specify the stage directory and the S3 path where Athena saves the query results.
    • Choose Save.

  1. On the Editor tab, run the following query against the table to check the data:
SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job rds-upsert-data-<CloudFormation stack name>.
  3. On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
    • For Key, enter --ticketactivity_id.
    • For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

  1. Choose Save.
  2. Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

  1. Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

  1. On the QuickSight console, choose your user name and choose Manage QuickSight.
  2. Choose Security & permissions, then choose Manage.
  3. Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
  4. Select Amazon Athena.
  5. Choose Save.
  6. If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena.
  4. For Data source name, enter a name (for example, hudi-blog).
  5. Choose Validate.
  6. After the validation is successful, choose Create data source.
  7. For Database, choose database_<your-account-number>_hudi_cdc_demo.
  8. For Tables, select ticket_activity.
  9. Choose Select.
  10. Choose Visualize.
  11. Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

  1. Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
  2. Navigate to the RDS database and choose Modify.
  3. Deselect Enable deletion protection, then choose Continue.
  4. Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  5. Delete the CloudFormation stack.
  6. On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
  7. Choose Account settings, then choose Delete account.
  8. Choose Delete account to confirm.
  9. Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.


About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

Post Syndicated from Deepthi Mohan original https://aws.amazon.com/blogs/big-data/a-side-by-side-comparison-of-apache-spark-and-apache-flink-for-common-streaming-use-cases/

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Apache Flink offers layered APIs that offer different levels of expressiveness and control and are designed to target different types of use cases. The three layers of API are Process Functions (also known as the Stateful Stream Processing API), DataStream, and Table and SQL. The Stateful Stream Processing API requires writing verbose code but offers the most control over time and state, which are core concepts in stateful stream processing. The DataStream API supports Java, Scala, and Python and offers primitives for many common stream processing operations, as well as a balance between code verbosity or expressiveness and control. The Table and SQL APIs are relational APIs that offer support for Java, Scala, Python, and SQL. They offer the highest abstraction and intuitive, SQL-like declarative control over data streams. Flink also allows seamless transition and switching across these APIs. To learn more about Flink’s layered APIs, refer to layered APIs.

Apache Spark Structured Streaming offers the Dataset and DataFrames APIs, which provide high-level declarative streaming APIs to represent static, bounded data as well as streaming, unbounded data. Operations are supported in Scala, Java, Python, and R. Spark has a rich function set and syntax with simple constructs for selection, aggregation, windowing, joins, and more. You can also use the Streaming Table API to read tables as streaming DataFrames as an extension to the DataFrames API. Although it’s hard to draw direct parallels between Flink and Spark across all stream processing constructs, at a very high level, we could say Spark Structured Streaming APIs are equivalent to Flink’s Table and SQL APIs. Spark Structured Streaming, however, does not yet (at the time of this writing) offer an equivalent to the lower-level APIs in Flink that offer granular control of time and state.

Both Flink and Spark Structured Streaming (referenced as Spark henceforth) are evolving projects. The following table provides a simple comparison of Flink and Spark capabilities for common streaming primitives (as of this writing).

. Flink Spark
Row-based processing Yes Yes
User-defined functions Yes Yes
Fine-grained access to state Yes, via DataStream and low-level APIs No
Control when state eviction occurs Yes, via DataStream and low-level APIs No
Flexible data structures for state storage and querying Yes, via DataStream and low-level APIs No
Timers for processing and stateful operations Yes, via low level APIs No

In the following sections, we cover the greatest common factors so that we can showcase how Spark users can relate to Flink and vice versa. To learn more about Flink’s low-level APIs, refer to Process Function. For the sake of simplicity, we cover the four use cases in this post using the Flink Table API. We use a combination of Python and SQL for an apples-to-apples comparison with Spark.

Data preparation

In this section, we compare data preparation methods for Spark and Flink.

Reading data

We first look at the simplest ways to read data from a data stream. The following sections assume the following schema for messages:

symbol: string,
price: int,
timestamp: timestamp,
company_info:
{
    name: string,
    employees_count: int
}

Reading data from a source in Spark Structured Streaming

In Spark Structured Streaming, we use a streaming DataFrame in Python that directly reads the data in JSON format:

spark = ...  # spark session

# specify schema
stock_ticker_schema = ...

# Create a streaming DataFrame
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "mybroker1:port") \
    .option("topic", "stock_ticker") \
    .load()
    .select(from_json(col("value"), stock_ticker_schema).alias("ticker_data")) \
    .select(col("ticker_data.*"))

Note that we have to supply a schema object that captures our stock ticker schema (stock_ticker_schema). Compare this to the approach for Flink in the next section.

Reading data from a source using Flink Table API

For Flink, we use the SQL DDL statement CREATE TABLE. You can specify the schema of the stream just like you would any SQL table. The WITH clause allows us to specify the connector to the data stream (Kafka in this case), the associated properties for the connector, and data format specifications. See the following code:

# Create table using DDL

CREATE TABLE stock_ticker (
  symbol string,
  price INT,
  timestamp TIMESTAMP(3),
  company_info STRING,
  WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE
) WITH (
 'connector' = 'kafka',
 'topic' = 'stock_ticker',
 'properties.bootstrap.servers' = 'mybroker1:port',
 'properties.group.id' = 'testGroup',
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'
)

JSON flattening

JSON flattening is the process of converting a nested or hierarchical JSON object into a flat, single-level structure. This converts multiple levels of nesting into an object where all the keys and values are at the same level. Keys are combined using a delimiter such as a period (.) or underscore (_) to denote the original hierarchy. JSON flattening is useful when you need to work with a more simplified format. In both Spark and Flink, nested JSONs can be complicated to work with and may need additional processing or user-defined functions to manipulate. Flattened JSONs can simplify processing and improve performance due to reduced computational overhead, especially with operations like complex joins, aggregations, and windowing. In addition, flattened JSONs can help in easier debugging and troubleshooting data processing pipelines because there are fewer levels of nesting to navigate.

JSON flattening in Spark Structured Streaming

JSON flattening in Spark Structured Streaming requires you to use the select method and specify the schema that you need flattened. JSON flattening in Spark Structured Streaming involves specifying the nested field name that you’d like surfaced to the top-level list of fields. In the following example, company_info is a nested field and within company_info, there’s a field called company_name. With the following query, we’re flattening company_info.name to company_name:

stock_ticker_df = ...  # Streaming DataFrame w/ schema shown above

stock_ticker_df.select("symbol", "timestamp", "price", "company_info.name" as "company_name")

JSON flattening in Flink

In Flink SQL, you can use the JSON_VALUE function. Note that you can use this function only in Flink versions equal to or greater than 1.14. See the following code:

SELECT
   symbol,
   timestamp,
   price,
   JSON_VALUE(company_info, 'lax $.name' DEFAULT NULL ON EMPTY) AS company_name
FROM
   stock_ticker

The term lax in the preceding query has to do with JSON path expression handling in Flink SQL. For more information, refer to System (Built-in) Functions.

Data processing

Now that you have read the data, we can look at a few common data processing patterns.

Deduplication

Data deduplication in stream processing is crucial for maintaining data quality and ensuring consistency. It enhances efficiency by reducing the strain on the processing from duplicate data and helps with cost savings on storage and processing.

Spark Streaming deduplication query

The following code snippet is related to a Spark Streaming DataFrame named stock_ticker. The code performs an operation to drop duplicate rows based on the symbol column. The dropDuplicates method is used to eliminate duplicate rows in a DataFrame based on one or more columns.

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

stock_ticker.dropDuplicates("symbol")

Flink deduplication query

The following code shows the Flink SQL equivalent to deduplicate data based on the symbol column. The query retrieves the first row for each distinct value in the symbol column from the stock_ticker stream, based on the ascending order of proctime:

SELECT symbol, timestamp, price
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY proctime ASC) AS row_num
  FROM stock_ticker)
WHERE row_num = 1

Windowing

Windowing in streaming data is a fundamental construct to process data within specifications. Windows commonly have time bounds, number of records, or other criteria. These time bounds bucketize continuous unbounded data streams into manageable chunks called windows for processing. Windows help in analyzing data and gaining insights in real time while maintaining processing efficiency. Analyses or operations are performed on constantly updating streaming data within a window.

There are two common time-based windows used both in Spark Streaming and Flink that we will detail in this post: tumbling and sliding windows. A tumbling window is a time-based window that is a fixed size and doesn’t have any overlapping intervals. A sliding window is a time-based window that is a fixed size and moves forward in fixed intervals that can be overlapping.

Spark Streaming tumbling window query

The following is a Spark Streaming tumbling window query with a window size of 10 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in tumbling window
# of size 10 minutes
visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming tumbling window query

The following is an equivalent tumbling window query in Flink with a window size of 10 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    TUMBLE(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '10' MINUTES))
  GROUP BY ticker;

Spark Streaming sliding window query

The following is a Spark Streaming sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in sliding window
# of size 10 minutes and slide interval of size
# 5 minutes

visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes", "5 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming sliding window query

The following is a Flink sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    HOP(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY ticker;

Handling late data

Both Spark Structured Streaming and Flink support event time processing, where a field within the payload can be used for defining time windows as distinct from the wall clock time of the machines doing the processing. Both Flink and Spark use watermarking for this purpose.

Watermarking is used in stream processing engines to handle delays. A watermark is like a timer that sets how long the system can wait for late events. If an event arrives and is within the set time (watermark), the system will use it to update a request. If it’s later than the watermark, the system will ignore it.

In the preceding windowing queries, you specify the lateness threshold in Spark using the following code:

.withWatermark("timestamp", "3 minutes")

This means that any records that are 3 minutes late as tracked by the event time clock will be discarded.

In contrast, with the Flink Table API, you can specify an analogous lateness threshold directly in the DDL:

WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE

Note that Flink provides additional constructs for specifying lateness across its various APIs.

Data enrichment

In this section, we compare data enrichment methods with Spark and Flink.

Calling an external API

Calling external APIs from user-defined functions (UDFs) is similar in Spark and Flink. Note that your UDF will be called for every record processed, which can result in the API getting called at a very high request rate. In addition, in production scenarios, your UDF code often gets run in parallel across multiple nodes, further amplifying the request rate.

For the following code snippets, let’s assume that the external API call entails calling the function:

response = my_external_api(request)

External API call in Spark UDF

The following code uses Spark:

class Predict(ScalarFunction):
def open(self, function_context):

with open("resources.zip/resources/model.pkl", "rb") as f:
self.model = pickle.load(f)

def eval(self, x):
return self.model.predict(x)

External API call in Flink UDF

For Flink, assume we define the UDF callExternalAPIUDF, which takes as input the ticker symbol symbol and returns enriched information about the symbol via a REST endpoint. We can then register and call the UDF as follows:

callExternalAPIUDF = udf(callExternalAPIUDF(), result_type=DataTypes.STRING())

SELECT
    symbol, 
    callExternalAPIUDF(symbol) as enriched_symbol
FROM stock_ticker;

Flink UDFs provide an initialization method that gets run one time (as opposed to one time per record processed).

Note that you should use UDFs judiciously as an improperly implemented UDF can cause your job to slow down, cause backpressure, and eventually stall your stream processing application. It’s advisable to use UDFs asynchronously to maintain high throughput, especially for I/O-bound use cases or when dealing with external resources like databases or REST APIs. To learn more about how you can use asynchronous I/O with Apache Flink, refer to Enrich your data stream asynchronously using Amazon Kinesis Data Analytics for Apache Flink.

Conclusion

Apache Flink and Apache Spark are both rapidly evolving projects and provide a fast and efficient way to process big data. This post focused on the top use cases we commonly encountered when customers wanted to see parallels between the two technologies for building real-time stream processing applications. We’ve included samples that were most frequently requested at the time of this writing. Let us know if you’d like more examples in the comments section.


About the author

Deepthi Mohan is a Principal Product Manager on the Amazon Kinesis Data Analytics team.

Karthi Thyagarajan was a Principal Solutions Architect on the Amazon Kinesis team.

Temporal data lake architecture for benchmark and indices analytics

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/architecture/temporal-data-lake-architecture-for-benchmark-and-indices-analytics/

Financial trading houses and stock exchanges generate enormous volumes of data in near real-time, making it difficult to perform bi-temporal calculations that yield accurate results. Achieving this requires a processing architecture that can handle large volumes of data during peak bursts, meet strict latency requirements, and scale according to incoming volumes.

In this post, we’ll describe a scenario for an industry leader in the financial services sector and explain how AWS services are used for bi-temporal processing with state management and scale based on variable workloads during the day, all while meeting strict service-level agreement (SLA) requirements.

Problem statement

To design and implement a fully temporal transactional data lake with the repeatable read isolation level for queries is a challenge, particularly with burst events that need the overall architecture to scale accordingly. The data store in the overall architecture needs to record the value history of data at different times, which is especially important for financial data. Financial data can include corporate actions, annual or quarterly reports, or fixed-income securities, like bonds that have variable rates. It’s crucial to be able to correct data inaccuracies during the reporting period.

The example customer seeks a data processing platform architecture to dynamically scale based on the workloads with a capacity of processing 150 million records under 5 minutes. Their platform should be capable of meeting the end-to-end SLA of 15 minutes, from ingestion to reporting, with lowest total cost of ownership. Additionally, managing bi-temporal data requires a database that has critical features, such as ACID (atomicity, consistency, isolation, durability) compliance, time-travel capability, full-schema evolution, partition layout and evolution, rollback to prior versions, and SQL-like query experience.

Solution overview

The solution architecture key building blocks are Amazon Kinesis Data Streams for streaming data, Amazon Kinesis Data Analytics with Apache Flink as processing engine, Flink’s RocksDB for state management, and Apache Iceberg on Amazon Simple Storage Service (Amazon S3) as the storage engine (Figure 1).

End-to-end data-processing architecture

Figure 1. End-to-end data-processing architecture

Data processing

Here’s how it works:

  • A publisher application receives the data from the source systems and publishes data into Kinesis Data Streams using a well-defined JSON format structure.
  • Kinesis Data Streams holds the data for a duration that is configurable so data is not lost and can auto scale based on the data volume ingested.
  • Kinesis Data Analytics runs an Apache Flink application, with state management (RocksDB), to handle bi-temporal calculations. The Apache Flink application consumes data from Kinesis Data Streams and performs the following computations:
    • Transforms the JSON stream into a row-type record, compatible with a SQL table-like structure, resolving nesting and parent–child relationships present within the stream
    • Checks whether the record has already an existing state in in-memory RocksDB or disk attached to Kinesis Data Analytics computational node to avoid read latency from the database, which is critical for meeting the performance requirements
    • Performs bi-temporal calculations and creates the resultant records in an in-memory data structure before invoking the Apache Iceberg sink operator
    • The Apache Flink application sink operator appends the temporal states, expressed as records into existing Apache Iceberg data store. This will comply with key principles of time series data, which is immutable, and the ability to time-travel along with ACID compliance, schema evolution, and partition evolution
  • Kinesis Data Analytics is resilient and provides a no-data-loss capability, with features like periodic checkpoints and savepoints. They are used to store the state management in a secure Amazon S3 location that can be accessed outside of Kinesis Data Analytics. This savepoints mechanism can be used to programmatically to scale the cluster size based on the workloads using time-driven scheduling and AWS Lambda functions.
  • If the time-to-live feature of RocksDB is implemented, old records are stored in Apache Iceberg on Amazon S3. When performing temporal calculations, if the state is not found in memory, data is read from Apache Iceberg into RocksDB and the processing is completed. However, this step is optional and can be circumvented if the Kinesis Data Analytics cluster is initialized with right number of Kinesis processing units to hold the historical information, as per requirements.
  • Because the data is stored in an Apache Iceberg table format in Amazon S3, data is queried using Trino, which supports Apache Iceberg table format.
  • The end user queries data using any SQL tool that supports the Trino query engine.

Apache Iceberg maintenance jobs, such as data compaction, expire snapshot, delete orphan files, can be launched using Amazon Athena to optimize performance out of Apache Iceberg data store. Details of each processing step performed in Apache Flink application are captured using Amazon CloudWatch, which logs all the events.

Scalability

Amazon EventBridge scheduler invokes a Lambda function to scale the Kinesis Data Analytics. Kinesis Data Analytics has a short outage during rescaling that is proportional to the amount of data stored in RocksDB, which is why a state management strategy is necessary for the proper operation of the system.

Figure 2 shows the scaling process, which depicts:

  • Before peak load: The Kinesis Data Analytics cluster is processing off-peak records with minimum configuration before the peak load. A scheduled event is launched from EventBridge that invokes a Lambda function, which shuts down the cluster using the savepoint mechanism and scales up the Kinesis Data Analytics cluster to required Kinesis processing units.
  • During peak load: When the peak data burst happens, the Kinesis Data Analytics cluster is ready to handle the volume of data from Kinesis Data Stream, and processes it within the SLA of 5 minutes.
  • After peak load: A scheduled event from EventBridge invokes a Lambda function to scale down the Kinesis Data Analytics cluster to the minimum configuration that holds the required state for the entire volume of records.
Cluster scaling before, during, and after peak data volume processing

Figure 2. Cluster scaling before, during, and after peak data volume processing

Performance insights

With the discussed architecture, we want to demonstrate that the we are able to meet the SLAs, in terms of performance and processing times. We have taken a subset of benchmarks and indices data and processed the same with the end-to-end architecture. During the process, we observed some very interesting findings, which we would like to share.

Processing time for Apache Iceberg Upsert vs Append operations: During our tests, we expected Upsert operation to be faster than append. But on the contrary, we noticed that Append operations were faster compared to Upsert even though more computations are performed in the Apache Flink application. In our test with 3,500,000 records, Append operation took 1556 seconds while Upsert took 1675 seconds to process the data (Figure 3).

Processing times for Upsert vs. Append

Figure 3. Processing times for Upsert vs. Append

Compute consumption for Apache Iceberg Upsert vs. Append operations: Comparing the compute consumption for 10,000,000 records, we noticed that Append operation was able to process the data in the same amount of time as Upsert operation but with less compute resources. In our tests, we have noted that Append operation only consumed 64 Kinesis processing units, whereas Upsert consumed 78 Kinesis processing units (Figure 4).

Comparing consumption for Upsert vs. Append

Figure 4. Comparing consumption for Upsert vs. Append

Scalability vs performance: To achieve the desired data processing performance, we need a specific configuration of Kinesis processing units, Kinesis Data Streams, and Iceberg parallelism. In our test with the data that we chose, we started with four Kinesis processing units and four Kinesis data streams for data processing. We observed an 80% performance improvement in data processing with 16 Kinesis data processing units. An additional 6% performance improvement was demonstrated when we scaled to 32 Kinesis processing units. When we increased the Kinesis data streams to 16, we observed an additional 2% performance improvement (Figure 5).

Scalability vs. performance

Figure 5. Scalability vs. performance

Data volume processing times for Upsert vs. Append: For this test, we started with 350,000 records of data. When we increased data volume to 3.5M records, we observed that Append performing better than Upsert, demonstrating a five-fold increase in processing time (Figure 6).

Data volume processing times for Upsert vs. Append

Figure 6. Data volume processing times for Upsert vs. Append

Conclusion

The architecture we explored today scales based on the data-volume requirements of the customer and is capable of meeting the end-to-end SLA of 15 minutes, with a potential lowered total cost of ownership. Additionally, the solution is capable of handling high-volume, bi-temporal computations with ACID compliance, time travel, full-schema evolution, partition layout evolution, rollback to prior versions and SQL-like query experience.

Further reading

New Solution – Clickstream Analytics on AWS for Mobile and Web Applications

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-solution-clickstream-analytics-on-aws-for-mobile-and-web-applications/

Starting today, you can deploy on your AWS account an end-to-end solution to capture, ingest, store, analyze, and visualize your customers’ clickstreams inside your web and mobile applications (both for Android and iOS). The solution is built on top of standard AWS services.

This new solution Clickstream Analytics on AWS allows you to keep your data in the security and compliance perimeter of your AWS account and customize the processing and analytics as you require, giving you the full flexibility to extract value for your business. For example, many business line owners want to combine clickstream analytics data with business system data to gain more comprehensive insights. Storing clickstream analysis data in your AWS account allows you to cross reference the data with your existing business system, which is complex to implement when you use a third-party analytics solution that creates an artificial data silo.

Clickstream Analytics on AWS is available from the AWS Solutions Library at no cost, except for the services it deploys on your account.

Why Analyze Your Applications Clickstreams?
Organizations today are in search of vetted solutions and architectural guidance to rapidly solve business challenges. Whether you prefer off-the-shelf deployments or customizable architectures, the AWS Solutions Library carries solutions built by AWS and AWS Partners for a broad range of industry and technology use cases.

When I talk with mobile and web application developers or product owners, you often tell me that you want to use a clickstream analysis solution to understand your customers’ behavior inside your application. Click stream analysis solutions help you to identify popular and frequently visited screens, analyze navigation patterns, identify bottlenecks and drop-off points, or perform A/B testing of functionalities such as the pay wall, but you face two challenges to adopt or build a click stream analysis solution.

Either you use a third-party library and analytics solution that sends all your application and customer data to an external provider, which causes security and compliance risks and makes it more difficult to reference your existing business data to enrich the analysis, or you dedicate time and resources to build your own solution based on AWS services, such as Amazon Kinesis (for data ingestion), Amazon EMR (for processing), Amazon Redshift (for storage), and Amazon QuickSight (for visualization). Doing so ensures your application and customer data stay in the security perimeter of your AWS account, which is already approved and vetted by your information and security team. Often, building such a solution is an undifferentiated task that drives resources and budget away from developing the core business of your application.

Introducing Clickstream Analytics on AWS
The new solution Clickstream Analytics on AWS provides you with a backend for data ingestion, processing, and visualization of click stream data. It’s shipped as an AWS CloudFormation template that you can easily deploy into the AWS account of your choice.

In addition to the backend component, the solution provides you with purpose-built Java and Swift SDKs to integrate into your mobile applications (for both Android and iOS). The SDKs automatically collects data and provide developers with an easy-to-use API to collect application-specific data. They manage the low-level tasks of buffering the data locally, sending them to the backend, managing the retries in case of communication errors, and more.

The following diagram shows you the high-level architecture of the solution.

Clickstream analysis - architecture

The solution comes with an easy-to-use console to configure your solution. For example, it allows you to choose between three AWS services to ingest the application clickstream data: Amazon Managed Streaming for Apache Kafka, Amazon Kinesis Data Streams, or Amazon Simple Storage Service (Amazon S3). You can create multiple data pipelines for multiple applications or teams, each using a different configuration. This allows you to adjust the backend to the application user base and requirements.

You can use plugins to transform the data during the processing phase. The solution comes with two plugins preinstalled: User-Agent enrichment and IP address enrichment to add additional data that’s related to the User-Agent and the geolocation of the IP address used by the client applications.

By default, it provides a Amazon Redshift Serverless cluster to minimize the costs, but you can select a provisioned Amazon Redshift configuration to meet your performance and budget requirements.

Finally, the solution provides you with a set of pre-assembled visualization dashboards to report on user acquisition, user activity, and user engagement. The dashboard consumes the data available in Amazon Redshift. You’re free to develop other analytics and other dashboards using the tools and services of your choice.

Let’s See It in Action
The best way to learn how to deploy and to configure Clickstream Analytics on AWS is to follow the tutorial steps provided by the Clickstream Analytics on AWS workshop.

The workshop goes into great detail about each step. Here are the main steps I did to deploy the solution:

1. I create the control plane (the management console) of the solution using this CloudFormation template. The output of the template contains the URL to the management console. I later receive an email with a temporary password for the initial connection.

2. On the Clickstream Analytics console, I create my first project and define various network parameters such as the VPC, subnets, and security groups. I also select the service to use for data ingestion and my choice of configuration for Amazon Redshift.

Clickstream analysis - Create project

Clickstream analysis - data sink

3. When I enter all configuration data, the console creates the data plane for my application.

AWS services and solutions are usually built around a control plane and one or multiple data planes. In the context of Clickstream Analytics, the control plane is the console that I use to define my data acquisition and analysis project. The data plane is the infrastructure to receive, analyze, and visualize my application data. Now that I define my project, the console generates and launches another CloudFormation template to create and manage the data plane.

4. The Clickstream Analytics console generates a JSON configuration file to include into my application and it shares the Java or Swift code to include into my Android or iOS application. The console provides instructions to add the clickstream analysis as a dependency to my application. I also update my application code to insert the code suggested and start to deploy.

Clickstream analysis - code for your applications

5. After my customers start to use the mobile app, I access the Clickstream Analytics dashboard to visualize the data collected.

The Dashboards
Clickstream Analytics dashboards are designed to provide a holistic view of the user lifecycle: the acquisition, the engagement, the activity, and the retention. In addition, it adds visibility into user devices and geographies. The solution automatically generates visualizations in these six categories: Acquisition, Engagement, Activity, Retention, Devices, and Navigation path. Here are a couple of examples.

The Acquisition dashboard reports the total number of users, the registered number of users (the ones that signed in), and the number of users by traffic source. It also computes the new users and registered users’ trends.

Clickstream analysis - acquisition dashboard

The Engagement dashboard reports the user engagement level (the number of user sessions versus the time users spent on my application). Specifically, I have access to the number of engaged sessions (sessions that last more than 10 seconds or have at least two screen views), the engagement rate (the percentage of engaged sessions from the total number of sessions), and the average engagement time.

Clickstream analysis - engagement dashboard

The Activity dashboard shows the event and actions taken by my customers in my application. It reports data, such as the number of events and number of views (or screens) shown, with the top events and views shown for a given amount of time.

Clickstream analysis - activity dashboard

The Retention tab shows user retention over time: the user stickiness for your daily, weekly, and monthly active users. It also shows the rate of returning users versus new users.

Clickstream analysis - retention

The Device tab shows data about your customer’s devices: operating systems, versions, screen sizes, and language.

Clickstream analysis - devices dashboard

And finally, the Path explorer dashboard shows your customers’ navigation path into the screens of your applications.

Clickstream analysis - path explorer dashboard

As I mentioned earlier, all the data are available in Amazon Redshift, so you’re free to build other analytics and dashboards.

Pricing and Availability
The Clickstream Analytics solution is available free of charge. You pay for the AWS services provisioned for you, including Kinesis or Amazon Redshift. Cost estimates depend on the configuration that you select. For example, the size of the Kinesis and Amazon Redshift cluster you select for your data ingestion and analytics needs, or the volume of data your applications send to the pipeline both affect the monthly cost of the solution.

To learn how to get started with this solution, take the Clickstream Analytics workshop today and stop sharing your customer and application clickstream data with third-party solutions.

— seb

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

Post Syndicated from Chaitanya Shah original https://aws.amazon.com/blogs/big-data/stream-vpc-flow-logs-to-datadog-via-amazon-kinesis-data-firehose/

It’s common to store the logs generated by customer’s applications and services in various tools. These logs are important for compliance, audits, troubleshooting, security incident responses, meeting security policies, and many other purposes. You can perform log analysis on these logs to understand users’ application behavior and patterns to make informed decisions.

When running workloads on Amazon Web Services (AWS), you need to analyze Amazon Virtual Private Cloud (Amazon VPC) Flow Logs to track the IP traffic going to and from the network interfaces for the workloads in their VPC. Analyzing VPC flow logs helps you understand how your applications are communicating over the VPC network and acts as a main source of information to the network in your VPC.

You can easily deliver data to supported destinations using the Amazon Kinesis Data Firehose integration with VPC flow logs. Kinesis Data Firehose is a fully managed service for delivering near-real-time streaming data to various destinations for storage and performing near-real-time analytics. With its extensible data transformation capabilities, you can also streamline log processing and log delivery pipelines into a single Kinesis Data Firehose delivery stream. You can perform analytics on VPC flow logs delivered from your VPC using the Kinesis Data Firehose integration with Datadog as a destination.

Datadog is a monitoring and security platform and AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in AWS Cloud Operations, DevOps, Migration, Security, Networking, Containers, and Microsoft Workloads, along with many others.

Datadog enables you to easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure. You can analyze all your AWS service logs while storing only the ones you need, generate metrics from aggregated logs to uncover, and send alerts about trends in your AWS services.

In this post, you learn how to integrate VPC flow logs with Kinesis Data Firehose and deliver it to Datadog.

Solution overview

This solution uses native integration of VPC flow logs streaming to Kinesis Data Firehose. We use a Kinesis Data Firehose delivery stream to buffer the streamed VPC flow logs to a Datadog destination endpoint in your Datadog account. You can use these logs with Datadog Log Management and Datadog Cloud SIEM to analyze the health, performance, and security of your cloud resources.

The following diagram illustrates the solution architecture.

We walk you through the following high-level steps:

  1. Link your AWS account with your Datadog account.
  2. Create the Kinesis Data Firehose stream where VPC service streams the flow logs.
  3. Create the VPC flow log subscription to Kinesis Data Firehose.
  4. Visualize VPC flow logs in the Datadog dashboard.

The account ID 123456781234 used in this post is a dummy account. It is used only for demonstration purposes.

Prerequisites

You should have the following prerequisites:

Link your AWS account with your Datadog account for AWS integration

Follow the instructions provided on the Datadog website for AWS Integration. To configure log archiving and enrich the log data sent from your AWS account with useful context, link the accounts. When you complete the linking setup, proceed to the following step.

Create a Kinesis Data Firehose stream

Now that your Datadog integration with AWS is complete, you can create a Kinesis Data Firehose delivery stream where VPC Flow Logs are streamed by following these steps:

  1. On the Amazon Kinesis console, choose Kinesis Data Firehose in the navigation pane.
  2. Choose Create delivery stream.
  3. Choose Direct PUT as the source.
  4. Set Destination as Datadog.
    Create delivery stream
  1. For Delivery stream name, enter PUT-DATADOG-DEMO.
  2. Keep Data transformation set to Disabled under Transform records.
  3. In Destination settings, for HTTP endpoint URL, choose the desired log’s HTTP endpoint based on your Region and Datadog account configuration.
    Kinesis delivery stream configuration
  4. For API key, enter your Datadog API key.

This allows your delivery stream to publish VPC Flow logs to the Datadog endpoint. API keys are unique to your organization. An API key is required by the Datadog Agent to submit metrics and events to Datadog.

  1. Set Content encoding to GZIP to reduce the size of data transferred.
  2. Set the Retry duration to 60.You can change the Retry duration value if you need to. This depends on the request handling capacity of the Datadog endpoint.
    Kinesis destination settings
    Under Buffer hints, Buffer size and Buffer interval are set with default values for Datadog integration.
    Kinesis buffer settings
  1. Under Backup settings, as mentioned in the prerequisites, choose the S3 bucket that you created to store failed logs and backup with specific prefix.
  2. Under S3 buffer hints section, set Buffer size to 5 and Buffer interval to 300.

You can change the S3 buffer size and interval based on your requirements.

  1. Under S3 compression and encryption, select GZIP for Compression for data records or another compression method of your choice.

Compressing data reduces the required storage space.

  1. Select Disabled for Encryption of the data records. You can enable encryption of the data records to secure access to your logs.
    Kinesis stream backup settings
  1. Optionally, in Advanced settings, select Enable server-side encryption for source records in delivery stream.
    You can use AWS managed keys or a CMK managed by you for the encryption type.
  1. Enable CloudWatch error logging.
  2. Choose Create or update IAM role, which is created by Kinesis Data Firehose as part of this stream.
    Kinesis stream Advanced settings
  1. Choose Next.
  2. Review your settings.
  3. Choose Create delivery stream.

Create a VPC flow logs subscription

Create a VPC flow logs subscription for the Kinesis Data Firehose delivery stream you created in the previous step:

  1. On the Amazon VPC console, choose Your VPCs.
  2. Select the VPC that you to create the flow log for.
  3. On the Actions menu, choose Create flow log.
    AWS VPCs
  1. Select All to send all flow log records to the Firehose destination.

If you want to filter the flow logs, you could alternatively select Accept or Reject.

  1. For Maximum aggregation interval, select 10 minutes or the minimum setting of 1 minute if you need the flow log data to be available for near-real-time analysis in Datadog.
  2. For Destination, select Send to Kinesis Data Firehose in the same account if the delivery stream is set up on the same account where you create the VPC flow logs.

If you want to send the data to a different account, refer to Publish flow logs to Kinesis Data Firehose.

  1. Choose an option for Log record format:
  2. If you leave Log record format as the AWS default format, the flow logs are sent as version 2 format.
  3. Alternatively, you can specify the custom fields for flow logs to capture and send it to Datadog.

For more information on log format and available fields, refer to Flow log records.

  1. Choose Create flow log.
    Create VPC Flow log

Now let’s explore the VPC flow logs in Datadog.

Visualize VPC flow logs in the Datadog dashboard

In the Logs Search option in the navigation pane, filter to source:vpc. The VPC flow logs from your VPC are in the Datadog Log Explorer and are automatically parsed so you can analyze your logs by source, destination, action, or other attributes.

Datadog Logs Dashboard

Clean up

After you test this solution, delete all the resources you created to avoid incurring future charges. Refer to the following links for instructions for deleting the resources:

Conclusion

In this post, we walked through a solution of how to integrate VPC flow logs with a Kinesis Data Firehose delivery stream, deliver it to a Datadog destination with no code, and visualize it in a Datadog dashboard. With Datadog, you can easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure.

Try this new, quick, and hassle-free way of sending your VPC flow logs to a Datadog destination using Kinesis Data Firehose.


About the Author

Chaitanya Shah - AWS Chaitanya Shah is a Sr. Technical Account Manager(TAM) with AWS, based out of New York. He has over 22 years of experience working with enterprise customers. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their AWS Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

Real-time inference using deep learning within Amazon Kinesis Data Analytics for Apache Flink

Post Syndicated from Jeremy Ber original https://aws.amazon.com/blogs/big-data/real-time-inference-using-deep-learning-within-amazon-kinesis-data-analytics-for-apache-flink/

Apache Flink is a framework and distributed processing engine for stateful computations over data streams. Amazon Kinesis Data Analytics for Apache Flink is a fully managed service that enables you to use an Apache Flink application to process streaming data. The Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning.

In this blog post, we demonstrate how you can use DJL within Kinesis Data Analytics for Apache Flink for real-time machine learning inference. Real-time inference can be valuable in a variety of applications and industries where it is essential to make predictions or take actions based on new data as quickly as possible with low latencies. We show how to load a pre-trained deep learning model from the DJL model zoo into a Flink job and apply the model to classify data objects in a continuous data stream. The DJL model zoo includes a wide variety of pre-trained models for image classification, semantic segmentation, speech recognition, text embedding generation, question answering, and more. It supports HuggingFace, Pytorch, MXNet, and TensorFlow model frameworks and also helps developers create and publish their own models. We will focus on image classification and use a popular classifier called ResNet-18 to produce predictions in real time. The model has been pre-trained on ImageNet with 1.2 million images belonging to 1,000 class labels.

We provide sample code, architecture diagrams, and an AWS CloudFormation template so you can follow along and employ ResNet-18 as your classifier to make real-time predictions. The solution we provide here is a powerful design pattern for continuously producing ML-based predictions on streaming data within Kinesis Data Analytics for Apache Flink. You can adapt the provided solution for your use case and choose an alternative model from the model zoo or even provide your own model.

Image classification is a classic problem that takes an image and selects the best-fitting class, such as whether the image from an autonomous driving system is that of a bicycle or a pedestrian. Common use cases for real-time inference on streams of images include classifying images from vehicle cameras and license plate recognition systems, and classifying images uploaded to social media and ecommerce websites. The use cases typically need low latency while handling high throughput and potentially bursty streams. For example, in ecommerce websites, real-time classification of uploaded images can help in marking pictures of banned goods or hazardous materials that have been supplied by sellers. Immediate determination through streaming inference is needed to trigger alerts and follow-up actions to prevent these images from being part of the catalog. This enables faster decision-making compared to batch jobs that run on a periodic basis. The data stream pipeline can involve multiple models for different purposes, such as classifying uploaded images into ecommerce categories of electronics, toys, fashion, and so on.

Solution overview

The following diagram illustrates the workflow of our solution.

architecture showcasing a kinesis data analytics for apache flink application reading from Images in an Amazon S3 bucket, classifying those images and then writing out to another S3 bucket called "classifications"

The application performs the following steps:

  1. Read in images from Amazon Simple Storage Service (Amazon S3) using the Apache Flink Filesystem File Source connector.
  2. Window the images into a collection of records.
  3. Classify the batches of images using the DJL image classification class.
  4. Write inference results to Amazon S3 at the path specified.

Images are recommended to be of reasonable size so that they may fit into a Kinesis Processing Unit. Images larger than 50MB in size may result in latency in processing and classification.

The main class for this Apache Flink job is located at src/main/java/com.amazon.embeddedmodelinference/EMI.java. Here you can find the main() method and entry point to our Flink job.

Prerequisites

To get started, configure the following prerequisites on your local machine:

Once this is set up, you can clone the code base to access the source code for this solution. The Java application code for this example is available on GitHub. To download the application code, clone the remote repository using the following command:

git clone https://github.com/aws-samples/amazon-kinesis-data-analytics-examples

Find and navigate to the folder of the image classification example, called image-classification.

An example set of images to stream and test the code is available in the imagenet-sample-images folder.

Let’s walk through the code step by step.

Test on your local machine

If you would like to test this application locally on your machine, ensure you have AWS credentials set up locally on your machine. Additionally, download the Flink S3 Filesystem Hadoop JAR to use with your Apache Flink installation and place it in a folder named plugins/s3 in the root of your project. Then configure the following environment variables either on your IDE or in your machine’s local variable scope:

IS_LOCAL=true;
plugins.dir=<<path-to-flink-s3-fs-hadoop jar>>
s3.access.key=<<aws access key>>
s3.secret.key=<<aws secret key>>

Replace these values with your own.showcasing the environment properties to replace on IntelliJ

After configuring the environment variables and downloading the necessary plugin JAR, let’s look at the code.

In the main method, after setting up our StreamExecutionEnvironment, we define our FileSource to read files from Amazon S3. By default, this source operator reads from a sample bucket. You can replace this bucket name with your own by changing the variable called bucket, or setting the application property on Kinesis Data Analytics for Apache Flink once deployed.

final FileSource<StreamedImage> source =
FileSource.forRecordStreamFormat(new ImageReaderFormat(), new Path(s3SourcePath))
               .monitorContinuously(Duration.ofSeconds(10))
               .build();

The FileSource is configured to read in files in the ImageReaderFormat, and will check Amazon S3 for new images every 10 seconds. This can be configured as well.

After we have read in our images, we convert our FileSource into a stream that can be processed:

DataStream<StreamedImage> stream =
env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

Next, we create a tumbling window of a variable time window duration, specified in the configuration, defaulting to 60 seconds. Every window close creates a batch (list) of images to be classified using a ProcessWindowFunction.

This ProcessWindowFunction calls the classifier predict function on the list of images and returns the best probability of classification from each image. This result is then sent back to the Flink operator, where it’s promptly written out to the S3 bucket path of your configuration.

.process(new ProcessWindowFunction<StreamedImage, String, String, TimeWindow>() {
                    @Override
                    public void process(String s,
                                        ProcessWindowFunction<StreamedImage, String, String, TimeWindow>.Context context,
                                        Iterable<StreamedImage> iterableImages,
                                        Collector<String> out) throws Exception {


                            List<Image> listOfImages = new ArrayList<Image>();
                            iterableImages.forEach(x -> {
                                listOfImages.add(x.getImage());
                            });
                        try
                        {
                            // batch classify images
                            List<Classifications> list = classifier.predict(listOfImages);
                            for (Classifications classifications : list) {
                                Classifications.Classification cl = classifications.best();
                                String ret = cl.getClassName() + ": " + cl.getProbability();
                                out.collect(ret);
                            }
                        } catch (ModelException | IOException | TranslateException e) {
                            logger.error("Failed predict", e);
                        }
                        }
                    });

In Classifier.java, we read the image and apply crop, transpose, reshape, and finally convert to an N-dimensional array that can be processed by the deep learning model. Then we feed the array to the model and apply a forward pass. During the forward pass, the model computes the neural network layer by layer. At last, the output object contains the probabilities for each image object that the model is being trained on. We map the probabilities with the object name and return to the map function.

Deploy the solution with AWS CloudFormation

To run this code base on Kinesis Data Analytics for Apache Flink, we have a helpful CloudFormation template that will spin up the necessary resources. Simply open AWS CloudShell or your local machine’s terminal and enter the following commands. Complete the following steps to deploy the solution:

  1. If you don’t have the AWS Cloud Development Kit (AWS CDK) bootstrapped in your account, run the following command, providing your account number and current Region:
cdk bootstrap aws://ACCOUNT-NUMBER/REGION

The script will clone a GitHub repo of images to classify and upload them to your source S3 bucket. Then it will launch the CloudFormation stack given your input parameters. video walking through the setup of the cloudformation template. Described in text later

  1. Enter the following code and replace the BUCKET variables with your own source bucket and sink bucket, which will contain the source images and the classifications, respectively:
export SOURCE_BUCKET=s3://SAMPLE-BUCKET/PATH;
export SINK_BUCKET=s3://SAMPLE_BUCKET/PATH;
git clone https://github.com/EliSchwartz/imagenet-sample-images; cd imagenet-sample-images;
aws s3 cp . $SOURCE_BUCKET --recursive --exclude "*/";
aws cloudformation create-stack --stack-name KDAImageClassification --template-url https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-3098/BlogStack.template.json --parameters ParameterKey=inputBucketPath,ParameterValue=$SOURCE_BUCKET ParameterKey=outputBucketPath,ParameterValue=$SINK_BUCKET --capabilities CAPABILITY_IAM;

This CloudFormation stack creates the following resources:

    • A Kinesis Data Analytics application with 1 Kinesis Processing Unit (KPU) preconfigured with some application properties
    • An S3 bucket for your output results
  1. When the stack is complete, navigate to the Kinesis Data Analytics for Apache Flink console.
  2. Find the application called blog-DJL-flink-ImageClassification-application and choose Run.
  3. On the Amazon S3 console, navigate to the bucket you specified in the outputBucketPath variable.

If you have readable images in the source bucket listed, you should see classifications of those images within the checkpoint interval of the running application.

Deploy the solution manually

If you prefer to use your own code base, you can follow the manual steps in this section:

  • After you clone the application locally, create your application JAR by navigating to the directory that contains your pom.xml and running the following command:
mvn clean package

This builds your application JAR in the target/ directory called embedded-model-inference-1.0-SNAPSHOT.jar.

application properties on KDA console

  1. Upload this application JAR to an S3 bucket, either the one created from the CloudFormation template, or another one to store code artifacts.
  2. You can then configure your Kinesis Data Analytics application to point to this newly uploaded S3 JAR file.
  3. This is also a great opportunity to configure your runtime properties, as shown in the following screenshot.
  4. Choose Run to start your application.

You can open the Apache Flink Dashboard to check for application exceptions or to see data flowing through the tasks defined in the code.

image of flink dashboard showing successful running of the application

Validate the results

To validate our results, let’s check the results in Amazon S3 by navigating to the Amazon S3 console and finding our S3 bucket. We can find the output in a folder called output-kda.

image showing folders within amazon s3 partitioned by datetime

When we choose one of the data-partitioned folders, we can see partition files. Ensure that there is no underscore in front of your part file, because this indicates that the results are still being finalized according to the rollover interval defined in Apache Flink’s FileSink connector. After the underscores have disappeared, we can use Amazon S3 Select to view our data.

partition files as they land in Amazon S3

We now have a solution that continuously performs classification on incoming images in real time using Kinesis Data Analytics for Apache Flink. It extracts a pre-trained classification model (ResNet-18) from the DJL model zoo, applies some preprocessing, loads the model into a Flink operator’s memory, and continuously applies the model for online predictions on streaming images.

Although we used ResNet-18 in this post, you can choose another model by modifying the classifier. The DJL model zoo provides many other models, both for classification and other applications, that can be used out of the box. You can also load your custom model by providing an S3 link or URL to the criteria. DJL supports models in a large number of engines such as PyTorch, ONNX, TensorFlow, and MXNet. Using a model in the solution is relatively simple. All of the preprocessing and postprocessing code is encapsulated in the (built-in) translator, so all we have to do is load the model, create a predictor, and call predict(). This is done within the data source operator, which processes the stream of input data and sends the links to the data to the inference operator where the model you selected produces the prediction. Then the sink operator writes the results.

The CloudFormation template in this example focused on a simple 1 KPU application. You could extend the solution to further scale out to large models and high-throughput streams, and support multiple models within the pipeline.

Clean up

To clean up the CloudFormation script you launched, complete the following steps:

  1. Empty the source bucket you specified in the bash script.
  2. On the AWS CloudFormation console, locate the CloudFormation template called KDAImageClassification.
  3. Delete the stack, which will remove all of the remaining resources created for this post.
  4. You may optionally delete the bootstrapping CloudFormation template, CDKToolkit, which you launched earlier as well.

Conclusion

In this post, we presented a solution for real-time classification using the Deep Java Library within Kinesis Data Analytics for Apache Flink. We shared a working example and discussed how you can adapt the solution for your specific ML use case. If you have any feedback or questions, please leave them in the comments section.


About the Authors

Jeremy Ber has been working in the telemetry data space for the past 9 years as a Software Engineer, Machine Learning Engineer, and most recently a Data Engineer. At AWS, he is a Streaming Specialist Solutions Architect, supporting both Amazon Managed Streaming for Apache Kafka (Amazon MSK) and AWS’s managed offering for Apache Flink.

Deepthi Mohan is a Principal Product Manager for Amazon Kinesis Data Analytics, AWS’s managed offering for Apache Flink.

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Real-time time series anomaly detection for streaming applications on Amazon Kinesis Data Analytics

Post Syndicated from Antonio Vespoli original https://aws.amazon.com/blogs/big-data/real-time-time-series-anomaly-detection-for-streaming-applications-on-amazon-kinesis-data-analytics/

Detecting anomalies in real time from high-throughput streams is key for informing on timely decisions in order to adapt and respond to unexpected scenarios. Stream processing frameworks such as Apache Flink empower users to design systems that can ingest and process continuous flows of data at scale. In this post, we present a streaming time series anomaly detection algorithm based on matrix profiles and left-discords, inspired by Lu et al., 2022, with Apache Flink, and provide a working example that will help you get started on a managed Apache Flink solution using Amazon Kinesis Data Analytics.

Challenges of anomaly detection

Anomaly detection plays a key role in a variety of real-world applications, such as fraud detection, sales analysis, cybersecurity, predictive maintenance, and fault detection, among others. The majority of these use cases require actions to be taken in near real-time. For instance, card payment networks must be able to identify and reject potentially fraudulent transactions before processing them. This raises the challenge to design near-real-time anomaly detection systems that are able to scale to ultra-fast arriving data streams.

Another key challenge that anomaly detection systems face is concept drift. The ever-changing nature of some use cases requires models to dynamically adapt to new scenarios. For instance, in a predictive maintenance scenario, you could use several Internet of Things (IoT) devices to monitor the vibrations produced by an electric motor with the objective of detecting anomalies and preventing unrecoverable damage. Sounds emitted by the vibrations of the motor can vary significantly over time due to different environmental conditions such as temperature variations, and this shift in pattern can invalidate the model. This class of scenarios creates the necessity for online learning—the ability of the model to continuously learn from new data.

Time series anomaly detection

Time series are a particular class of data that incorporates time in their structuring. The data points that characterize a time series are recorded in an orderly fashion and are chronological in nature. This class of data is present in every industry and is common at the core of many business requirements or key performance indicators (KPIs). Natural sources of time series data include credit card transactions, sales, sensor measurements, machine logs, and user analytics.

In the time series domain, an anomaly can be defined as a deviation from the expected patterns that characterize the time series. For instance, a time series can be characterized by its expected ranges, trends, seasonal, or cyclic patterns. Any significant alteration of this normal flow of data points is considered an anomaly.

Detecting anomalies can be more or less challenging depending on the domain. For instance, a threshold-based approach might be suitable for time series that are informed of their expected ranges, such as the working temperature of a machine or CPU utilization. On the other hand, applications such as fraud detection, cybersecurity, and predictive maintenance can’t be classified via simple rule-based approaches and require a more fine-grained mechanism to capture unexpected observations. Thanks to their parallelizable and event-driven setup, streaming engines such as Apache Flink provide an excellent environment for scaling real-time anomaly detection to fast-arriving data streams.

Solution overview

Apache Flink is a distributed processing engine for stateful computations over streams. A Flink program can be implemented in Java, Scala, or Python. It supports ingestion, manipulation, and delivery of data to the desired destinations. Kinesis Data Analytics allows you to run Flink applications in a fully managed environment on AWS.

Distance-based anomaly detection is a popular approach where a model is characterized by a number of internally stored data points that are used for comparison against the new incoming data points. At inference time, these methods compute distances and classify new data points according to how dissimilar they are from the past observations. In spite of the plethora of algorithms in literature, there is increasing evidence that distance-based anomaly detection algorithms are still competitive with the state of the art (Nakamura et al., 2020).

In this post, we present a streaming version of a distance-based unsupervised anomaly detection algorithm called time series discords, and explore some of the optimizations introduced by the Discord Aware Matrix Profile (DAMP) algorithm (Lu et al., 2022), which further develops the discords method to scale to trillions of data points.

Understanding the algorithm

A left-discord is a subsequence that is significantly dissimilar from all the subsequences that precede it. In this post, we demonstrate how to use the concept of left-discords to identify time series anomalies in streams using Kinesis Data Analytics for Apache Flink.

Let’s consider an unbounded stream and all its subsequences of length n. The m most recent subsequences will be stored and used for inference. When a new data point arrives, a new subsequence that includes the new event is formed. The algorithm compares this latest subsequence (query) to the m subsequences retained from the model, with the exclusion of the latest n subsequences because they overlap with the query and would therefore characterize a self-match. After computing these distances, the algorithm classifies the query as an anomaly if its distance from its closest non-self-matching subsequence is above a certain moving threshold.

For this post, we use a Kinesis data stream to ingest the input data, a Kinesis Data Analytics application to run the Flink anomaly detection program, and another Kinesis data stream to ingest the output produced by your application. For visualization purposes, we consume from the output stream using Kinesis Data Analytics Studio, which provides an Apache Zeppelin Notebook that we use to visualize and interact with the data in real time.

Implementation details

The Java application code for this example is available on GitHub. To download the application code, complete the following steps:

  1. Clone the remote repository using the following command:
    git clone https://github.com/aws-samples/amazon-kinesis-data-analytics-java-examples

  2. Navigate to the amazon-kinesis-data-analytics-java-examples/AnomalyDetection/LeftDiscords directory:

Let’s walk through the code step by step.

The MPStreamingJob class defines the data flow of the application, and the MPProcessFunction class defines the logic of the function that detects anomalies.

The implementation is best described by three core components:

  • The Kinesis data stream source, used to read from the input stream
  • The anomaly detection process function
  • The Kinesis data stream sink, used to deliver the output into the output stream

The anomaly detection function is implemented as a ProcessFunction<String, String>. Its method MPProcessFunction#processElement is called for every data point:

@Override
public void processElement(String dataPoint, ProcessFunction<String, OutputWithLabel>.Context context,
                            Collector<OutputWithLabel> collector) {

   Double record = Double.parseDouble(dataPoint);

   int currentIndex = timeSeriesData.add(record);

   Double minDistance = 0.0;
   String anomalyTag = "INITIALISING";

   if (timeSeriesData.readyToCompute()) {
       minDistance = timeSeriesData.computeNearestNeighbourDistance();
       threshold.update(minDistance);
   }

   /*
   * Algorithm will wait for initializationPeriods * sequenceLength data points until starting
   * to compute the Matrix Profile (MP).
   */
   if (timeSeriesData.readyToInfer()) {
       anomalyTag = minDistance > threshold.getThreshold() ? "IS_ANOMALY" : "IS_NOT_ANOMALY";
   }

   OutputWithLabel output = new OutputWithLabel(currentIndex, record, minDistance, anomalyTag);

   collector.collect(output);
}

For every incoming data point, the anomaly detection algorithm takes the following actions:

  1. Adds the record to the timeSeriesData.
  2. If it has observed at least 2 * sequenceLength data points, starts computing the matrix profile.
  3. If it has observed at least initializationPeriods * sequenceLength data points, starts outputting anomaly labels.

Following these actions, the MPProcessFunction function outputs an OutputWithLabel object with four attributes:

  • index – The index of the data point in the time series
  • input – The input data without any transformation (identity function)
  • mp – The distance to the closest non-self-matching subsequence for the subsequence ending in index
  • anomalyTag – A binary label that indicates whether the subsequence is an anomaly

In the provided implementation, the threshold is learned online by fitting a normal distribution to the matrix profile data:

/*
 * Computes the threshold as two standard deviations away from the mean (p = 0.02)
 *
 * @return an estimated threshold
 */
public Double getThreshold() {
   Double mean = sum/counter;

   return mean + 2 * Math.sqrt(squaredSum/counter - mean*mean);
}

In this example, the algorithm classifies as anomalies those subsequences whose distance from their nearest neighbor deviates significantly from the average minimum distance (more than two standard deviations away from the mean).

The TimeSeries class implements the data structure that retains the context window, namely, the internally stored records that are used for comparison against the new incoming records. In the provided implementation, the n most recent records are retained, and when the TimeSeries object is at capacity, the oldest records are overridden.

Prerequisites

Before you create a Kinesis Data Analytics application for this exercise, create two Kinesis data streams: InputStream and OutputStream in us-east-1. The Flink application will use these streams as its respective source and destination streams. To create these resources, launch the following AWS CloudFormation stack:

Launch Stack

Alternatively, follow the instructions in Creating and Updating Data Streams.

Create the application

To create your application, complete the following steps:

  1. Clone the remote repository using the following command:
    git clone https://github.com/aws-samples/amazon-kinesis-data-analytics-java-examples 

  2. Navigate to the amazon-kinesis-data-analytics-java-examples/AnomalyDetection/LeftDiscords/core directory.
    cd amazon-kinesis-data-analytics-java-examples/AnomalyDetection/LeftDiscords/core

  3. Create your JAR file by running the following Maven command in the core directory, which contains the pom.xml file:
    mvn package -Dflink.version=1.15.4
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket and upload the file target/left-discords-1.0.0.jar.
  5. Create and run a Kinesis Data Analytics application as described in Create and Run the Kinesis Data Analytics Application:
    1. Use the target/left-discords-1.0.0.jar.
    2. Note that the input and output streams are called InputStream and OutputStream, respectively.
    3. The provided example is set up to run in us-east-1.

Populate the input stream

You can populate InputStream by running the script.py file from the cloned repository, using the command python script.py. By editing the last two lines, you can populate the stream with synthetic data or with real data from a CSV dataset.

Visualize data on Kinesis Data Analytics Studio

Kinesis Data Analytics Studio provides the perfect setup for observing data in real time. The following screenshot shows sample visualizations. The first plot shows the incoming time series data, the second plot shows the matrix profile, and the third plot shows which data points have been classified as anomalies.

To visualize the data, complete the following steps:

  1. Create a notebook.
  2. Add the following paragraphs to the Zeppelin note:

Create a table and define the shape of the records generated by the application:

%flink.ssql

CREATE TABLE data (
index INT,
input VARCHAR(6),
mp VARCHAR(6),
anomalyTag VARCHAR(20)
)
PARTITIONED BY (index)
WITH (
'connector' = 'kinesis',
'stream' = 'OutputStream',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
)

Visualize the input data (choose Line Chart from the visualization options):

%flink.ssql(type=update)

SELECT index, input FROM data;

Visualize the output matrix profile data (choose Scatter Chart from the visualization options):

%flink.ssql(type=update)

SELECT index, mp FROM data;

Visualize the labeled data (choose Scatter Chart from the visualization options):

%flink.ssql(type=update)

SELECT index, anomalyTag FROM data;

Clean up

To delete all the resources that you created, follow the instructions in Clean Up AWS Resources.

Future developments

In this section, we discuss future developments for this solution.

Optimize for speed

The online time series discords algorithm is further developed and optimized for speed in Lu et al., 2022. The proposed optimizations include:

  • Early stopping – If the algorithm finds a subsequence that is similar enough (below the threshold), it stops searching and marks the query as non-anomaly.
  • Look-ahead windowing – Look at some amount of data in the future and compare it to the current query to cheaply discover and prune future subsequences that could not be left-discords. Note that this introduces some delay. The reason why disqualifying improves performance is that data points that are close in time are more likely to be similar than data points that are distant in time.
  • Use of MASS – The MASS (Mueen’s Algorithm for Similarity Search) search algorithm is designed for efficiently discovering the most similar subsequence in the past.

Parallelization

The algorithm above operates with parallelism 1, which means that when a single worker is enough to handle the data stream throughput, the above algorithm can be directly used. This design can be enhanced with further distribution logic for handling high throughput scenarios. In order to parallelise this algorithm, you may to design a partitioner operator that ensures that the anomaly detection operators would have at their disposal the relevant past data points. The algorithm can maintain a set of the most recent records to which it compares the query. Efficiency and accuracy trade-offs of approximate solutions are interesting to explore. Since the best solution for parallelising the algorithm depends largely on the nature of the data, we recommend experimenting with various approaches using your domain-specific knowledge.

Conclusion

In this post, we presented a streaming version of an anomaly detection algorithm based on left-discords. By implementing this solution, you learned how to deploy an Apache Flink-based anomaly detection solution on Kinesis Data Analytics, and you explored the potential of Kinesis Data Analytics Studio for visualizing and interacting with streaming data in real time. For more details on how to implement anomaly detection solutions in Apache Flink, refer to the GitHub repository that accompanies this post. To learn more about Kinesis Data Analytics and Apache Flink, explore the Amazon Kinesis Data Analytics Developer Guide.

Give it a try and share your feedback in the comments section.


About the Authors

Antonio Vespoli is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Samuel Siebenmann is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Nuno Afonso is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Real-time anomaly detection via Random Cut Forest in Amazon Kinesis Data Analytics

Post Syndicated from Daren Wong original https://aws.amazon.com/blogs/big-data/real-time-anomaly-detection-via-random-cut-forest-in-amazon-kinesis-data-analytics/

Real-time anomaly detection describes a use case to detect and flag unexpected behavior in streaming data as it occurs. Online machine learning (ML) algorithms are popular for this use case because they don’t require any explicit rules and are able to adapt to a changing baseline, which is particularly useful for continuous streams of data where incoming data changes continuously over time.

Random Cut Forest (RCF) is one such algorithm widely used for anomaly detection use cases. In typical setups, we want to be able to run the RCF algorithm on input data with large throughput, and streaming data processing frameworks can help with that. We are excited to share that RCF is possible with Amazon Kinesis Data Analytics for Apache Flink. Apache Flink is a popular open-source framework for real-time, stateful computations over data streams, and can be used to run RCF on input streams with large throughput.

This post demonstrates how we can use Kinesis Data Analytics for Apache Flink to run an online RCF algorithm for anomaly detection.

Solution overview

The following diagram illustrates our architecture, which consists of three components: an input data stream using Amazon Kinesis Data Streams, a Flink job, and an output Kinesis data stream. In terms of data flow, we use a Python script to generate anomalous sine wave data into the input data stream, the data is then processed by RCF in a Flink job, and the resultant anomaly score is delivered to the output data stream.

The following graph shows an example of our expected result, which indicates that the anomaly score peaked when the sine wave data source anomalously dropped to constant -17.

We can implement this solution in three simple steps:

  1. Set up AWS resources via AWS CloudFormation.
  2. Set up a data generator to produce data into the source data stream.
  3. Run the RCF Flink Java code on Kinesis Data Analytics.

Set up AWS resources via AWS CloudFormation

The following CloudFormation stack will create all the AWS resources we need for this tutorial, including two Kinesis data streams, a Kinesis Data Analytics app, and an Amazon Simple Storage Service (Amazon S3) bucket.

Sign in to your AWS account, then choose Launch Stack:

BDB-2063-launch-cloudformation-stack

Follow the steps on the AWS CloudFormation console to create the stack.

Set up a data generator

Run the following Python script to populate the input data stream with the anomalous sine wave data:

import json
import boto3
import math 

STREAM_NAME = "ExampleInputStream-RCF"


def get_data(time):
    rad = (time/100)%360
    val = math.sin(rad)*10 + 10

    if rad > 2.4 and rad < 2.6:
        val = -17

    return {'time': time, 'value': val}

def generate(stream_name, kinesis_client):
    time = 0

    while True:
        data = get_data(time)
        kinesis_client.put_record(
            StreamName=stream_name,
            Data=json.dumps(data),
            PartitionKey="partitionkey")

        time += 1


if __name__ == '__main__':
    generate(STREAM_NAME, boto3.client('kinesis', region_name='us-west-2'))

Run the RCF Flink Java code on Kinesis Data Analytics

The CloudFormation stack automatically downloaded and packaged the RCF Flink job JAR file for you. Therefore, you can simply go to the Kinesis Data Analytics console to run your application.

That’s it! We now have a running Flink job that continuously reads in data from an input Kinesis data stream and calculates the anomaly score for each new data point given the previous data points it has seen.

The following sections explain the RCF implementation and Flink job code in more detail.

RCF implementation

Numerous RCF implementations are publicly available. For this tutorial, we use the AWS implementation by wrapping it around a custom wrapper (RandomCutForestOperator) to be used in our Flink job.

RandomCutForestOperator is implemented as an Apache Flink ProcessFunction, which is a function that allows us to write custom logic to process every element in the stream. Our custom logic starts with a data transformation via inputDataMapper.apply, followed by getting the anomaly score by calling the AWS RCF library via rcf.getAnomalyScore. The code implementation of RandomCutForestOperator can be found on GitHub.

RandomCutForestOperatorBuilder requires two main types of parameters:

  • RandomCutForestOperator hyperparameters – We use the following:
    • Dimensions – We set this to 1 because our input data is a 1-dimensional sine wave consisting of the float data type.
    • ShingleSize – We set this to 1, which means our RCF algorithm will take into account the previous and current data points in anomaly score deduction. Note that this can be increased to account for seasonality in data.
    • SampleSize – We set this to 628, which means a maximum of 628 data points is kept in the data sample for each tree.
  • DataMapper parameters for input and output processing – We use the following:
    • InputDataMapper – We use RandomCutForestOperator.SIMPLE_FLOAT_INPUT_DATA_MAPPER to map input data from float to float[].
    • ResultMapper – We use RandomCutForestOperator.SIMPLE_TUPLE_RESULT_DATA_MAPPER, which is a BiFunction that joins the anomaly score with the corresponding sine wave data point into a tuple.

Flink job code

The following code snippet illustrates the core streaming structure of our Apache Flink streaming Java code. It first reads in data from the source Kinesis data stream, then processes it using the RCF algorithm. The computed anomaly score is then written to an output Kinesis data stream.

DataStream<Float> sineWaveSource = createSourceFromStaticConfig(env);

sineWaveSource
        .process(
                RandomCutForestOperator.<Float, Tuple2<Float, Double>>builder()
                        .setDimensions(1)
                        .setShingleSize(1)
                        .setSampleSize(628)
                        .setInputDataMapper(RandomCutForestOperator.SIMPLE_FLOAT_INPUT_DATA_MAPPER)
                        .setResultMapper(RandomCutForestOperator.SIMPLE_TUPLE_RESULT_DATA_MAPPER)
                        .build(),
                TupleTypeInfo.getBasicTupleTypeInfo(Float.class, Double.class))
       .addSink(createSinkFromStaticConfig());

In this example, our baseline input data is a sine wave. As shown in the following screenshot, a low anomaly score is returned when the data is regular. However, when there is an anomaly in the data (when the sine wave input data drops to a constant), a high anomaly score is returned. The anomaly score is delivered into an output Kinesis data stream. You can visualize this result by creating a Kinesis Data Analytics Studio app; for instructions, refer to Interactive analysis of streaming data.

Because this is an unsupervised algorithm, you don’t need to provide any explicit rules or labeled datasets for this operator. In short, only the input data stream, data conversions, and some hyperparameters were provided. The RCF algorithm itself determined the expected baseline based on the input data and identified any unexpected behavior.

Furthermore, this means the model will continuously adapt even if the baseline changes over time. As such, minimal retraining cadence is required. This is powerful for anomaly detection on streaming data because the data will often drift slowly over time due seasonal trends, inflation, equipment calibration drift, and so on.

Clean up

To avoid incurring future charges, complete the following steps:

  1. On the Amazon S3 console, empty the S3 bucket created by the CloudFormation stack.
  2. On the AWS CloudFormation console, delete the CloudFormation stack.

Conclusion

This post demonstrated how to perform anomaly detection on input streaming data with RCF, an online unsupervised ML algorithm using Kinesis Data Analytics. We also showed how this algorithm learns the data baseline on its own, and can adapt to changes in the baseline over time. We hope you consider this solution for your real-time anomaly detection use cases.


About the Authors

Daren Wong is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Aleksandr Pilipenko is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Hong Liang Teoh is a Software Development Engineer in AWS. He works on Amazon Kinesis Data Analytics, the managed offering for running Apache Flink applications on AWS.

Accelerate data insights with Elastic and Amazon Kinesis Data Firehose

Post Syndicated from Udayasimha Theepireddy original https://aws.amazon.com/blogs/big-data/accelerate-data-insights-with-elastic-and-amazon-kinesis-data-firehose/

This is a guest post co-written with Udayasimha Theepireddy from Elastic.

Processing and analyzing log and Internet of Things (IoT) data can be challenging, especially when dealing with large volumes of real-time data. Elastic and Amazon Kinesis Data Firehose are two powerful tools that can help make this process easier. For example, by using Kinesis Data Firehose to ingest data from IoT devices, you can stream data directly into Elastic for real-time analysis. This can help you identify patterns and anomalies in the data as they happen, allowing you to take action in real time. Additionally, by using Elastic to store and analyze log data, you can quickly search and filter through large volumes of log data to identify issues and troubleshoot problems.

In this post, we explore how to integrate Elastic and Kinesis Data Firehose to streamline log and IoT data processing and analysis. We walk you through a step-by-step example of how to send VPC flow logs to Elastic through Kinesis Data Firehose.

Solution overview

Elastic is an AWS ISV Partner that helps you find information, gain insights, and protect your data when you run on AWS. Elastic offers enterprise search, observability, and security features that are built on a single, flexible technology stack that can be deployed anywhere.

Kinesis Data Firehose is a popular service that delivers streaming data from over 20 AWS services such as AWS IoT Core and Amazon CloudWatch logs to over 15 analytical and observability tools such as Elastic. Kinesis Data Firehose provides a fast and easy way to send your VPC flow logs data to Elastic in minutes without a single line of code and without building or managing your own data ingestion and delivery infrastructure.

VPC flow logs capture the traffic information going to and from your network interfaces in your VPC. With the launch of Kinesis Data Firehose support to Elastic, you can analyze your VPC flow logs with just a few clicks. Kinesis Data Firehose provides a true end-to-end serverless mechanism to deliver your flow logs to Elastic, where you can use Elastic Dashboards to search through those logs, create dashboards, detect anomalies, and send alerts. VPC flow logs help you to answer questions like what percentage of your traffic is getting dropped, and how much traffic is getting generated for specific sources and destinations.

Integrating Elastic and Kinesis Data Firehose is a straightforward process. There are no agents and beats. Simply configure your Firehose delivery stream to send its data to Elastic’s endpoint.

The following diagram depicts this specific configuration of how to ingest VPC flow logs via Kinesis Data Firehose into Elastic.

In the past, users would have to use an AWS Lambda function to transform the incoming data from VPC flow logs into an Amazon Simple Storage Service (Amazon S3) bucket before loading it into Kinesis Data Firehose or create a CloudWatch Logs subscription that sends any incoming log events that match defined filters to the Firehose delivery stream.

With this new integration, you can set up this configuration directly from your VPC flow logs to Kinesis Data Firehose and into Elastic Cloud. (Note that Elastic Cloud must be deployed on AWS.)

Let’s walk through the details of configuring Kinesis Data Firehose and Elastic, and demonstrate ingesting data.

Prerequisites

To set up this demonstration, make sure you have the following prerequisites:

We walk through installing general AWS integration components into the Elastic Cloud deployment to ensure Kinesis Data Firehose connectivity. Refer to the full list of services supported by the Elastic/AWS integration for more information.

Deploy Elastic on AWS

Follow the instructions on the Elastic registration page to get started on Elastic Cloud.

Once logged in to Elastic Cloud, create a deployment on AWS. It’s important to ensure that the deployment is on AWS. The Firehose delivery stream connects specifically to an endpoint that needs to be on AWS.

After you create your deployment, copy the Elasticsearch endpoint to use in a later step.

The endpoint should be an AWS endpoint, such as https://thevaa-cluster-01.es.us-east-1.aws.found.io.

Enable Elastic’s AWS integration

In your deployment’s Elastic Integration section, navigate to the AWS integration and choose Install AWS assets.

Configure a Firehose delivery stream

Create a new delivery stream on the Kinesis Data Firehose console. This is where you provide the endpoint you saved earlier. Refer to the following screenshot for the destination settings, and for more details, refer to Choose Elastic for Your Destination.

In this example, we are pulling in VPC flow logs via the data stream parameter we added (logs-aws.vpcflow-default). The parameter es_datastream_name can be configured with one of the following types of logs:

  • logs-aws.cloudfront_logs-defaultAWS CloudFront logs
  • logs-aws.ec2_logs-defaultAmazon Elastic Compute Cloud (Amazon EC2) logs in CloudWatch
  • logs-aws.elb_logs-defaultElastic Load Balancing logs
  • logs-aws.firewall_logs-defaultAWS Network Firewall logs
  • logs-aws.route53_public_logs-defaultAmazon Route 53 public DNS queries logs
  • logs-aws.route53_resolver_logs-default – Route 53 DNS queries and responses logs
  • logs-aws.s3access-default – Amazon S3 server access log
  • logs-aws.vpcflow-default – VPC flow logs
  • logs-aws.waf-defaultAWS WAF logs

Deploy your application

Follow the instructions on the GitHub repo and instructions in the AWS Three Tier Web Architecture workshop to deploy your application.

After you install the app, get your credentials from AWS to use with Elastic’s AWS integration.

There are several options for credentials:

  • Use access keys directly
  • Use temporary security credentials
  • Use a shared credentials file
  • Use an AWS Identity and Access Management (IAM) role Amazon Resource Name (ARN)

For more details, refer to AWS Credentials and AWS Permissions.

Configure VPC flow logs to send to Kinesis Data Firehose

In the VPC for the application you deployed, you need to configure your VPC flow logs and point them to the Firehose delivery stream.

Validate the VPC flow logs

In the Elastic Observability view of the log streams, you should see the VPC flow logs coming in after a few minutes, as shown in the following screenshot.

Analyze VPC flow logs in Elastic

Now that you have VPC flow logs in Elastic Cloud, how can you analyze them? There are several analyses you can perform on the VPC flow log data:

  • Use Elastic’s Analytics Discover capabilities to manually analyze the data
  • Use Elastic Observability’s anomaly feature to identify anomalies in the logs
  • Use an out-of-the-box dashboard to further analyze the data

Use Elastic’s Analytics Discover to manually analyze data

In Elastic Analytics, you can search and filter your data, get information about the structure of the fields, and display your findings in a visualization. You can also customize and save your searches and place them on a dashboard.

For a complete understanding of Discover and all of Elastic’s Analytics capabilities, refer to Discover.

For VPC flow logs, it’s important to understand the following:

  • How many logs were accepted or rejected
  • Where potential security violations occur (source IPs from outside the VPC)
  • What port is generally being queried

For our example, we filter the logs on the following:

  • Delivery stream nameAWS-3-TIER-APP-VPC-LOGS
  • VPC flow log actionREJECT
  • Time frame – 5 hours
  • VPC network interface – Webserver 1 and Webserver 2 interfaces

We want to see what IP addresses are trying to hit our web servers. From that, we want to understand which IP addresses we’re getting the most REJECT actions from. We simply find the source.ip field and can quickly get a breakdown that shows 185.156.73.54 is the most rejected for the last 3 or more hours we’ve turned on VPC flow logs.

Additionally, we can create a visualization by choosing Visualize. We get the following donut chart, which we can add to a dashboard.

Additionally to IP addresses, we want to also see what port is being hit on our web servers.

We select the destination port field, and the pop-up shows us that port 8081 is being targeted. This port is generally used for the administration of Apache Tomcat. This is a potential security issue, however port 8081 is turned off for outside traffic, hence the REJECT.

Detect anomalies in Elastic Observability logs

In addition to Discover, Elastic Observability provides the ability to detect anomalies on logs using machine learning (ML). The feature has the following options:

  • Log rate – Automatically detects anomalous log entry rates
  • Categorization – Automatically categorizes log messages

For our VPC flow log, we enabled both features. When we look at what was detected for anomalous log entry rates, we get the following results.

Elastic immediately detected a spike in logs when we turned on VPC flow logs for our application. The rate change is being detected because we’re also ingesting VPC flow logs from another application for a couple of days prior to adding the application in this post.

We can drill down into this anomaly with ML and analyze further.

To learn more about the ML analysis you can utilize with your logs, refer to Machine learning.

Because we know that a spike exists, we can also use the Elastic AIOps Labs Explain Log Rate Spikes capability. Additionally, we’ve grouped them to see what is causing some of the spikes.

In the preceding screenshot, we can observe that a specific network interface is sending more VPC log flows than others. We can drill down into this further in Discover.

Use the VPC flow log dashboard

Finally, Elastic also provides an out-of-the-box dashboard to show the top IP addresses hitting your VPC, geographically where they are coming from, the time series of the flows, and a summary of VPC flow log rejects within the time frame.

You can enhance this baseline dashboard with the visualizations you find in Discover, as we discussed earlier.

Conclusion

This post demonstrated how to configure an integration with Kinesis Data Firehose and Elastic for efficient infrastructure monitoring of VPC flow logs in Elastic Kibana dashboards. Elastic offers flexible deployment options on AWS, supporting software as a service (SaaS), AWS Marketplace, and bring your own license (BYOL) deployments. Elastic also provides AWS Marketplace private offers. You have the option to deploy and run the Elastic Stack yourself within your AWS account, either free or with a paid subscription from Elastic. To get started, visit the Kinesis Data Firehose console and specify Elastic as the destination. To learn more, explore the Amazon Kinesis Data Firehose Developer Guide.


About the Authors

Udayasimha Theepireddy is an Elastic Principal Solution Architect, where he works with customers to solve real world technology problems using Elastic and AWS services. He has a strong background in technology, business, and analytics.

Antony Prasad Thevaraj is a Sr. Partner Solutions Architect in Data and Analytics at AWS. He has over 12 years of experience as a Big Data Engineer, and has worked on building complex ETL and ELT pipelines for various business units.

Mostafa Mansour is a Principal Product Manager – Tech at Amazon Web Services where he works on Amazon Kinesis Data Firehose. He specializes in developing intuitive product experiences that solve complex challenges for customers at scale. When he’s not hard at work on Amazon Kinesis Data Firehose, you’ll likely find Mostafa on the squash court, where he loves to take on challengers and perfect his dropshots.

Build a real-time GDPR-aligned Apache Iceberg data lake

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/build-a-real-time-gdpr-aligned-apache-iceberg-data-lake/

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data operators who must be able to erase or update personal data from their data lake when requested.

A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment. When a customer asks to erase or update private data, the data lake operator needs to find the required objects in Amazon S3 that contain the required data and take steps to erase or update that data. This activity can be a complex process for the following reasons:

  • Data lakes may contain many S3 objects (each may contain multiple rows), and often it’s difficult to find the object containing the exact data that needs to be erased or personally identifiable information (PII) to be updated as per the request
  • By nature, S3 objects are immutable and therefore applying direct row-based transactions like DELETE or UPDATE isn’t possible

To handle these situations, a transactional feature on S3 objects is required, and frameworks such as Apache Hudi or Apache Iceberg provide you the transactional feature for upserts in Amazon S3.

AWS contributed the Apache Iceberg integration with the AWS Glue Data Catalog, which enables you to use open-source data computation engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg, enabling transaction queries on S3 objects.

In this post, we show you how to stream real-time data to an Iceberg table in Amazon S3 using AWS Glue streaming and perform transactions using Amazon Athena for deletes and updates. We use a serverless mechanism for this implementation, which requires minimum operational overhead to manage and fine-tune various configuration parameters, and enables you to extend your use case to ACID operations beyond the GDPR.

Solution overview

We used the Amazon Kinesis Data Generator (KDG) to produce synthetic streaming data in Amazon Kinesis Data Streams and then processed the streaming input data using AWS Glue streaming to store the data in Amazon S3 in Iceberg table format. As part of the customer’s request, we ran delete and update statements using Athena with Iceberg support.

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

  1. Streaming data is generated in JSON format using the KDG template and inserted into Kinesis Data Streams.
  2. An AWS Glue streaming job is connected to Kinesis Data Streams to process the data using the Iceberg connector.
  3. The streaming job output is stored in Amazon S3 in Iceberg table format.
  4. Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in Iceberg format.
  5. Athena interacts with the Data Catalog tables in Iceberg format for transactional queries required for GDPR.

The codebase required for this post is available in the GitHub repository.

Prerequisites

Before starting the implementation, make sure the following prerequisites are met:

Deploy resources using AWS CloudFormation

Complete the following steps to deploy your solution resources:

  1. After you sign in to your AWS account, launch the CloudFormation template by choosing Launch Stack:
  2. For Stack name, enter a name.
  3. For Username, enter the user name for the KDG.
  4. For Password, enter the password for the KDG (this must be at least six alphanumeric characters, and contain at least one number).
  5. For IAMGlueStreamingJobRoleName, enter a name for the IAM role used for the AWS Glue streaming job.
  6. Choose Next and create your stack.

This CloudFormation template configures the following resources in your account:

  • An S3 bucket named streamingicebergdemo-XX (note that the XX part is a random unique number to make the S3 bucket name unique)
  • An IAM policy and role
  • The KDG URL used for creating synthetic data
  1. After you complete the setup, go to the Outputs tab of the CloudFormation stack to get the S3 bucket name, AWS Glue job execution role (as per your input), and KDG URL.
  2. Before proceeding with the demo, create a folder named custdata under the created S3 bucket.

Create a Kinesis data stream

We use Kinesis Data Streams to create a serverless streaming data service that is built to handle millions of events with low latency. The following steps guide you on how to create the data stream in the us-east-1 Region:

  1. Log in to the AWS Management Console.
  2. Navigate to Kinesis console (make sure the Region is us-east-1).
  3. Select Kinesis Data Streams and choose Create data stream.
  4. For Data stream name, enter demo-data-stream.
  5. For this post, we select On-demand as the Kinesis data stream capacity mode.

On-demand mode works to eliminate the need for provisioning and managing the capacity for streaming data. However, you can implement this solution with Kinesis Data Streams in provisioned mode as well.

  1. Choose Create data stream.
  2. Wait for successful creation of demo-data-stream and for it to be in Active status.

Set up the Kinesis Data Generator

To create a sample streaming dataset, we use the KDG URL generated on the CloudFormation stack Outputs tab and log in with the credentials used in the parameters for the CloudFormation template. For this post, we use the following template to generate sample data in the demo-data-stream Kinesis data stream.

  1. Log in to the KDG URL with the user name and password you supplied during stack creation.
  2. Change the Region to us-east-1.
  3. Select the Kinesis data stream demo-data-stream.
  4. For Records per second, choose Constant and enter 100 (it can be another number, depending on the rate of record creation).
  5. On the Template 1 tab, enter the KDG data generation template:
{
"year": "{{random.number({"min":2000,"max":2022})}}",
"month": "{{random.number({"min":1,"max":12})}}",
"day": "{{random.number({"min":1,"max":30})}}",
"hour": "{{random.number({"min":0,"max":24})}}",
"minute": "{{random.number({"min":0,"max":60})}}",
"customerid": {{random.number({"min":5023,"max":59874})}},
"firstname" : "{{name.firstName}}",
"lastname" : "{{name.lastName}}",
"dateofbirth" : "{{date.past(70)}}",
"city" : "{{address.city}}",
"buildingnumber" : {{random.number({"min":63,"max":947})}},
"streetaddress" : "{{address.streetAddress}}",
"state" : "{{address.state}}",
"zipcode" : "{{address.zipCode}}",
"country" : "{{address.country}}",
"countrycode" : "{{address.countryCode}}",
"phonenumber" : "{{phone.phoneNumber}}",
"productname" : "{{commerce.productName}}",
"transactionamount": {{random.number(
{
"min":10,
"max":150
}
)}}
}
  1. Choose Test template to test the sample records.
  2. When the testing is correct, choose Send data.

This will start sending 100 records per second in the Kinesis data stream. (To stop sending data, choose Stop Sending Data to Kinesis.)

Integrate Iceberg with AWS Glue

To add the Apache Iceberg Connector for AWS Glue, complete the following steps. The connector is free to use and supports AWS Glue 1.0, 2.0, and 3.0.

  1. On the AWS Glue console, choose AWS Glue Studio in the navigation pane.
  2. In the navigation pane, navigate to AWS Marketplace.
  3. Search for and choose Apache Iceberg Connector for AWS Glue.
  4. Choose Accept Terms and Continue to Subscribe.
  5. Choose Continue to Configuration.
  6. For Fulfillment option, choose your AWS Glue version.
  7. For Software version, choose the latest software version.
  8. Choose Continue to Launch.
  9. Under Usage Instructions, choose the link to activate the connector.
  10. Enter a name for the connection, then choose Create connection and activate the connector.
  11. Verify the new connector on the AWS Glue Studio Connectors.

Create the AWS Glue Data Catalog database

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location and schema of your data. You use the information in the Data Catalog to create and monitor your ETL jobs.

For this post, we create a Data Catalog database named icebergdemodb containing the metadata information of a table named customer, which will be queried through Athena.

  1. On the AWS Glue console, choose Databases in the navigation pane.
  2. Choose Add database.
  3. For Database name, enter icebergdemodb.

This creates an AWS Glue database for metadata storage.

Create a Data Catalog table in Iceberg format

In this step, we create a Data Catalog table in Iceberg table format.

  1. On the Athena console, create an Athena workgroup named demoworkgroup for SQL queries.
  2. Choose Athena engine version 3 for Query engine version.

For more information about Athena versions, refer to Changing Athena engine versions.

  1. Enter the S3 bucket location for Query result configuration under Additional configurations.
  2. Open the Athena query editor and choose demoworkgroup.
  3. Choose the database icebergdemodb.
  4. Enter and run the following DDL to create a table pointing to the Data Catalog database icerbergdemodb. Note that the TBLPROPERTIES section mentions ICEBERG as the table type and LOCATION points to the S3 folder (custdata) URI created in earlier steps. This DDL command is available on the GitHub repo.
CREATE TABLE icebergdemodb.customer(
year string,
month string,
day string,
hour string,
minute string,
customerid string,
firstname string,
lastname string,
dateofbirth string,
city string,
buildingnumber string,
streetaddress string,
state string,
zipcode string,
country string,
countrycode string,
phonenumber string,
productname string,
transactionamount int)
LOCATION '<S3 Location URI>'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_target_data_file_size_bytes'='536870912',
'optimize_rewrite_delete_file_threshold'='10'
);

After you run the command successfully, you can see the table customer in the Data Catalog.

Create an AWS Glue streaming job

In this section, we create the AWS Glue streaming job, which fetches the record from the Kinesis data stream using the Spark script editor.

  1. On the AWS Glue console, choose Jobs (new) in the navigation pane.
  2. For Create job¸ select Spark script editor.
  3. For Options¸ select Create a new script with boilerplate code.
  4. Choose Create.
  5. Enter the code available in the GitHub repo in the editor.

The sample code keeps appending data in the target location by fetching records from the Kinesis data stream.

  1. Choose the Job details tab in the query editor.
  2. For Name, enter Demo_Job.
  3. For IAM role¸ choose demojobrole.
  4. For Type, choose Spark Streaming.
  5. For Glue Version, choose Glue 3.0.
  6. For Language, choose Python 3.
  7. For Worker type, choose G 0.25X.
  8. Select Automatically scale the number of workers.
  9. For Maximum number of workers, enter 5.
  10. Under Advanced properties, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, choose the connector you created.
  12. For Job parameters, enter the following key pairs (provide your S3 bucket and account ID):
Key Value
--iceberg_job_catalog_warehouse s3://streamingicebergdemo-XX/custdata/
--output_path s3://streamingicebergdemo-XX
--kinesis_arn arn:aws:kinesis:us-east-1:<AWS Account ID>:stream/demo-data-stream
--user-jars-first True

  1. Choose Run to start the AWS Glue streaming job.
  2. To monitor the job, choose Monitoring in the navigation pane.
  3. Select Demo_Job and choose View run details to check the job run details and Amazon CloudWatch logs.

Run GDPR use cases on Athena

In this section, we demonstrate a few use cases that are relevant to GDPR alignment with the user data that’s stored in Iceberg format in the Amazon S3-based data lake as implemented in the previous steps. For this, let’s consider that the following requests are being initiated in the workflow to comply with the regulations:

  • Delete the records for the input customerid (for example, 59289)
  • Update phonenumber for the customerid (for example, 51842)

The IDs used in this example are samples only because they were created through the KDG template used earlier, which creates sample data. You can search for IDs in your implementation by querying through the Athena query editor. The steps remain the same.

Delete data by customer ID

Complete the following steps to fulfill the first use case:

  1. On the Athena console, and make sure icebergdemodb is chosen as the database.
  2. Open the query editor.
  3. Enter the following query using a customer ID and choose Run:
SELECT count(*)
FROM icebergdemodb.customer
WHERE customerid = '59289';

This query gives the count of records for the input customerid before delete.

  1. Enter the following query with the same customer ID and choose Run:
MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '59289') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN DELETE;

This query deletes the data for the input customerid as per the workflow generated.

  1. Test if there is data with the customer ID using a count query.

The count should be 0.

Update data by customer ID

Complete the following steps to test the second use case:

  1. On the Athena console, make sure icebergdemodb is chosen as the database.
  2. Open the query editor.
  3. Enter the following query with a customer ID and choose Run.
SELECT customerid, phonenumber
FROM icebergdemodb.customer
WHERE customerid = '51936';

This query gives the value for phonenumber before update.

  1. Run the following query to update the required columns:
MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '51936') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN UPDATE SET phonenumber = '000';

This query updates the data to a dummy value.

  1. Run the SELECT query to check the update.

You can see the data is updated correctly.

Vacuum table

A good practice is to run the VACUUM command periodically on the table because operations like INSERT, UPDATE, DELETE, and MERGE will take place on the Iceberg table. See the following code:

VACUUM icebergdemodb.customer;

Considerations

The following are a few considerations to keep in mind for this implementation:

Clean up

Complete the following steps to clean up the resources you created for this post:

    1. Delete the custdata folder in the S3 bucket.
    2. Delete the CloudFormation stack.
    3. Delete the Kinesis data stream.
    4. Delete the S3 bucket storing the data.
    5. Delete the AWS Glue job and Iceberg connector.
    6. Delete the AWS Glue Data Catalog database and table.
    7. Delete the Athena workgroup.
    8. Delete the IAM roles and policies.

Conclusion

This post explained how you can use the Iceberg table format on Athena to implement GDPR use cases like data deletion and data upserts as required, when streaming data is being generated and ingested through AWS Glue streaming jobs in Amazon S3.

The operations for the Iceberg table that we demonstrated in this post aren’t all of the data operations that Iceberg supports. Refer to the Apache Iceberg documentation for details on various operations.


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Rajdip Chaudhuri is Solutions Architect with Amazon Web Services specializing in data and analytics. He enjoys working with AWS customers and partners on data and analytics requirements. In his spare time, he enjoys soccer.

How Vanguard made their technology platform resilient and efficient by building cross-Region replication for Amazon Kinesis Data Streams

Post Syndicated from Raghu Boppanna original https://aws.amazon.com/blogs/big-data/how-vanguard-made-their-technology-platform-resilient-and-efficient-by-building-cross-region-replication-for-amazon-kinesis-data-streams/

This is a guest post co-written with Raghu Boppanna from Vanguard. 

At Vanguard, the Enterprise Advice line of business improves investor outcomes through digital access to superior, personalized, and affordable financial advice. They made it possible, in part, by driving economies of scale across the globe for investors with a highly resilient and efficient technical platform. Vanguard opted for a multi-Region architecture for this workload to help protect against impairments of Regional services. For high availability purposes, there is a need to make the data used by the workload available not just in the primary Region, but also in the secondary Region with minimal replication lag. In the event of a service impairment in the primary Region, the solution should be able to fail over to the secondary Region with as little data loss as possible and the ability to resume data ingestion.

Vanguard Cloud Technology Office and AWS partnered to build an infrastructure solution on AWS that met their resilience requirements. The multi-Region solution enables a robust fail-over mechanism, with built-in observability and recovery. The solution also supports streaming data from multiple sources to different Kinesis data streams. The solution is currently being rolled out to the different lines of business teams to improve the resilience posture of their workloads.

The use case discussed here requires Change Data Capture (CDC) to stream data from a remote data source (mainframe DB2) to Amazon Kinesis Data Streams, because the business capability depends on this data. Kinesis Data Streams is a fully managed, massively scalable, durable, and low-cost streaming service that can continuously capture and stream large amounts of data from multiple sources, and makes the data available for consumption within milliseconds. The service is built to be highly resilient and uses multiple Availability Zones to process and store data.

The solution discussed in this post explains how AWS and Vanguard innovated to build a resilient architecture to meet their high availability goals.

Solution overview

The solution uses AWS Lambda to replicate data from Kinesis data streams in the primary Region to a secondary Region. In the event of any service impairment impacting the CDC pipeline, the failover process promotes the secondary Region to primary for the producers and consumers. We use Amazon DynamoDB global tables for replication checkpoints that allows to resume data streaming from the checkpoint and also maintains a primary Region configuration flag that prevents an infinite replication loop of the same data back and forth.

The solution also provides the flexibility for Kinesis Data Streams consumers to use the primary or any secondary Region within the same AWS account.

The following diagram illustrates the reference architecture.

Let’s look at each component in detail:

  1. CDC processor (producer) – In this reference architecture, the producer is deployed on Amazon Elastic Compute Cloud (Amazon EC2) in both the primary and secondary Regions, and is active in the primary Region and on standby mode in the secondary Region. It captures CDC data from the external data source (like a DB2 database as shown in the architecture above), and streams to Kinesis Data Streams in the primary Region. Vanguard uses a 3rd party tool Qlik Replicate as their CDC Processor. It produces a well-formed payload including the DB2 commit timestamp to the Kinesis data stream, in addition to the actual row data from the remote data source. (example-stream-1 in this example). The following code is a sample payload containing only the primary key of the record that changed and the commit timestamp (for simplicity, the rest of the table row data is not shown below):
    {
        "eventSource": "aws:kinesis",
        "kinesis": 
        {
             "ApproximateArrivalTimestamp": "Mon July 18 20:00:00 UTC 2022",
             "SequenceNumber": "49544985256907370027570885864065577703022652638596431874",
             "PartitionKey": "12349999",
             "KinesisSchemaVersion": "1.0",
             "Data": "eyJLZXkiOiAxMjM0OTk5OSwiQ29tbWl0VGltZXN0YW1wIjogIjIwMjItMDctMThUMjA6MDA6MDAifQ=="
        },
        "eventId": "shardId-000000000000:49629136582982516722891309362785181370337771525377097730",
        "invokeIdentityArn": "arn:aws:iam::6243876582:role/kds-crr-LambdaRole-1GZWP67437SD",
        "eventName": "aws:kinesis:record",
        "eventVersion": "1.0",
        "eventSourceARN": "arn:aws:kinesis:us-east-1:6243876582:stream/kds-stream-1/consumer/kds-crr:6243876582",
        "awsRegion": "us-east-1"
    }

    The Base64 decoded value of Data is as follows. The actual Kinesis record would contain the entire row data of the table row that changed, in addition to the primary key and the commit timestamp.

    {"Key": 12349999,"CommitTimestamp": "2022-07-18T20:00:00"}

    The CommitTimestamp in the Data field is used in the replication checkpoint and is critical to accurately track how much of the stream data has been replicated to the secondary Region. The checkpoint can then be used to facilitate a CDC processor (producer) failover and accurately resume producing data from the replication checkpoint timestamp onwards.

    The alternative to using a remote data source CommitTimestamp (if unavailable) is to use the ApproximateArrivalTimestamp (which is the timestamp when the record is actually written to the data stream).

  2. Cross-Region replication Lambda function – The function is deployed to both primary and secondary Regions. It’s set up with an event source mapping to the data stream containing CDC data. The same function can be used to replicate data of multiple streams. It’s invoked with a batch of records from Kinesis Data Streams and replicates the batch to a target replication Region (which is provided via the Lambda configuration environment). For cost considerations, if the CDC data is actively produced into the primary Region only, the reserved concurrency of the function in the secondary Region can be set to zero, and modified during regional failover. The function has AWS Identity and Access Management (IAM) role permissions to do the following:
    • Read and write to the DynamoDB global tables used in this solution, within the same account.
    • Read and write to Kinesis Data Streams in both Regions within the same account.
    • Publish custom metrics to Amazon CloudWatch in both Regions within the same account.
  3. Replication checkpoint – The replication checkpoint uses the DynamoDB global table in both the primary and secondary Regions. It’s used by the cross-Region replication Lambda function to persist the commit timestamp of the last replication record as the replication checkpoint for every stream that is configured for replication. For this post, we create and use a global table called kdsReplicationCheckpoint.
  4. Active Region config – The active Region uses the DynamoDB global table in both primary and secondary Regions. It uses the native cross-Region replication capability of the global table to replicate the configuration. It’s pre-populated with data about which is the primary Region for a stream, to prevent replication back to the primary Region by the Lambda function in the standby Region. This configuration may not be required if the Lambda function in the standby Region has a reserved concurrency set to zero, but can serve as a safety check to avoid infinite replication loop of the data. For this post, we create a global table called kdsActiveRegionConfig and put an item with the following data:
    {
     "stream-name": "example-stream-1",
     "active-region" : "us-east-1"
    }
    
  5. Kinesis Data Streams – The stream to which the CDC processor produces the data. For this post, we use a stream called example-stream-1 in both the Regions, with the same shard configuration and access policies.

Sequence of steps in cross-Region replication

Let’s briefly look at how the architecture is exercised using the following sequence diagram.

The sequence consists of the following steps:

  1. The CDC processor (in us-east-1) reads the CDC data from the remote data source.
  2. The CDC processor (in us-east-1) streams the CDC data to Kinesis Data Streams (in us-east-1).
  3. The cross-Region replication Lambda function (in us-east-1) consumes the data from the data stream (in us-east-1). The enhanced fan-out pattern is recommended for dedicated and increased throughput for cross-Region replication.
  4. The replicator Lambda function (in us-east-1) validates its current Region with the active Region configuration for the stream being consumed, with the help of the kdsActiveRegionConfig DynamoDB global tableThe following sample code (in Java) can help illustrate the condition being evaluated:
    // Fetch the current AWS Region from the Lambda function’s environment
    String currentAWSRegion = System.getenv(“AWS_REGION”);
    // Read the stream name from the first Kinesis Record once for the entire batch being processed. This is done because we are reusing the same Lambda function for replicating multiple streams.
    String currentStreamNameConsumed = kinesisRecord.getEventSourceARN().split(“:”)[5].split(“/”)[1];
    // Build the DynamoDB query condition using the stream name
    Map<String, Condition> keyConditions = singletonMap(“streamName”, Condition.builder().comparisonOperator(EQ).attributeValueList(AttributeValue.builder().s(currentStreamNameConsumed).build()).build());
    // Query the DynamoDB Global Table
    QueryResponse queryResponse = ddbClient.query(QueryRequest.builder().tableName("kdsActiveRegionConfig").keyConditions(keyConditions).attributesToGet(“ActiveRegion”).build());
  5. The function evaluates the response from DynamoDB with the following code:
    // Evaluate the response
    if (queryResponse.hasItems()) {
           AttributeValue activeRegionForStream = queryResponse.items().get(0).get(“ActiveRegion”);
           return currentAWSRegion.equalsIgnoreCase(activeRegionForStream.s());
    }
  6. Depending on the response, the function takes the following actions:
    1. If the response is true, the replicator function produces the records to Kinesis Data Streams in us-east-2 in a sequential manner.
      • If there is a failure, the sequence number of the record is tracked and the iteration is broken. The function returns the list of failed sequence numbers. By returning the failed sequence number, the solution uses the feature of Lambda checkpointing to be able to resume processing of a batch of records with partial failures. This is useful when handling any service impairments, where the function tries to replicate the data across Regions to ensure stream parity and no data loss.
      • If there are no failures, an empty list is returned, which indicates the batch was successful.
    2. If the response is false, the replicator function returns without performing any replication. To reduce the cost of the Lambda invocations, you can set the reserved concurrency of the function in the DR Region (us-east-2) to zero. This will prevent the function from being invoked. When you failover, you can update this value to an appropriate number based on the CDC throughput and set the reserved concurrency of the function in us-east-1 to zero to prevent it from executing unnecessarily.
  7. After all the records are produced to Kinesis Data Streams in us-east-2, the replicator function checkpoints to the kdsReplicationCheckpoint DynamoDB global table (in us-east-1) with the following data:
    { "streamName": "example-stream-1", "lastReplicatedTimestamp": "2022-07-18T20:00:00" }
    
  8. The function returns after successfully processing the batch of records.

Performance considerations

The performance expectations of the solution should be understood with respect to the following factors:

  • Region selection – The replication latency is directly proportional to the distance being traveled by the data, so understand your Region selection
  • Velocity – The incoming velocity of the data or the volume of data being replicated
  • Payload size – The size of the payload being replicated

Monitor the Cross-Region replication

It’s recommended to track and observe the replication as it happens. You can tailor the Lambda function to publish custom metrics to CloudWatch with the following metrics at the end of every invocation. Publishing these metrics to both the primary and secondary Regions helps protect yourself from impairments affecting observability in the primary Region.

  • Throughput – The current Lambda invocation batch size
  • ReplicationLagSeconds – The difference between the current timestamp (after processing all the records) and the ApproximateArrivalTimestamp of the last record that was replicated

The following example CloudWatch metric graph shows the average replication lag was 2 seconds with a throughput of 100 records replicated from us-east-1 to us-east-2.

Common failover strategy

During any impairments impacting the CDC pipeline in the primary Region, business continuity or disaster recovery needs may dictate a pipeline failover to the secondary (standby) Region. This means a couple of things need to be done as part of this failover process:

  • If possible, stop all the CDC tasks in the CDC processor tool in us-east-1.
  • The CDC processor must be failed over to the secondary Region, so that it can read the CDC data from the remote data source while operating out of the standby Region.
  • The kdsActiveRegionConfig DynamoDB global table needs to be updated. For instance, for the stream example-stream-1 used in our example, the active Region is changed to us-east-2:
{
"stream-name": "example-stream-1",
"active-Region" : "us-east-2"
}
  • All the stream checkpoints need to be read from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2), and the timestamps from each of the checkpoints are used to start the CDC tasks in the producer tool in us-east-2 Region. This minimizes the chances of data loss and accurately resumes streaming the CDC data from the remote data source from the checkpoint timestamp onwards.
  • If using reserved concurrency to control Lambda invocations, set the value to zero in the primary Region(us-east-1) and to a suitable non-zero value in the secondary Region(us-east-2).

Vanguard’s multi-step failover strategy

Some of the third-party tools that Vanguard uses have a two-step CDC process of streaming data from a remote data source to a destination. Vanguard’s tool of choice for their CDC processor follows this two-step approach:

  1. The first step involves setting up a log stream task that reads the data from the remote data source and persists in a staging location.
  2. The second step involves setting up individual consumer tasks that read data from the staging location—which could be on Amazon Elastic File System (Amazon EFS) or Amazon FSx, for example—and stream it to the destination. The flexibility here is that each of these consumer tasks can be triggered to stream from different commit timestamps. The log stream task usually starts reading data from the minimum of all the commit timestamps used by the consumer tasks.

Let’s look at an example to explain the scenario:

  • Consumer task A is streaming data from a commit timestamp 2022-07-19T20:00:00 onwards to example-stream-1.
  • Consumer task B is streaming data from a commit timestamp 2022-07-19T21:00:00 onwards to example-stream-2.
  • In this situation, the log stream should read data from the remote data source from the minimum of the timestamps used by the consumer tasks, which is 2022-07-19T20:00:00.

The following sequence diagram demonstrates the exact steps to run during a failover to us-east-2 (the standby Region).

The steps are as follows:

  1. The failover process is triggered in the standby Region (us-east-2 in this example) when required. Note that the trigger can be automated using comprehensive health checks of the pipeline in the primary Region.
  2. The failover process updates the kdsActiveRegionConfig DynamoDB global table with the new value for the Region as us-east-2 for all the stream names.
  3. The next step is to fetch all the stream checkpoints from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2).
  4. After the checkpoint information is read, the failover process finds the minimum of all the lastReplicatedTimestamp.
  5. The log stream task in the CDC processor tool is started in us-east-2 with the timestamp found in Step 4. It begins reading CDC data from the remote data source from this timestamp onwards and persists them in the staging location on AWS.
  6. The next step is to start all the consumer tasks to read data from the staging location and stream to the destination data stream. This is where each consumer task is supplied with the appropriate timestamp from the kdsReplicationCheckpoint table according to the streamName to which the task streams the data.

After all the consumer tasks are started, data is produced to the Kinesis data streams in us-east-2. From there on, the process of cross-Region replication is the same as described earlier – the replication Lambda function in us-east-2 starts replicating data to the data stream in us-east-1.

The consumer applications reading data from the streams are expected to be idempotent to be able to handle duplicates. Duplicates can be introduced in the stream due to many reasons, some of which are called out below.

  • The Producer or the CDC Processor introduces duplicates into the stream while replaying the CDC data during a failover
  • DynamoDB Global Table uses asynchronous replication of data across Regions and if the kdsReplicationCheckpoint table data has a replication lag, the failover process may potentially use an older checkpoint timestamp to replay the CDC data.

Also, consumer applications should checkpoint the CommitTimestamp of the last record that was consumed. This is to facilitate better monitoring and recovery.

Path to maturity: Automated recovery

The ideal state is to fully automate the failover process, reducing time to recover and meeting the resilience Service Level Objective (SLO). However, in most organizations, the decision to fail over, fail back, and trigger the failover requires manual intervention in assessing the situation and deciding the outcome. Creating scripted automation to perform the failover that can be run by a human is a good place to start.

Vanguard has automated all of the steps of failover, but still have humans make the decision on when to invoke it. You can customize the solution to meet your needs and depending on the CDC processor tool you use in your environment.

Conclusion

In this post, we described how Vanguard innovated and built a solution for replicating data across Regions in Kinesis Data Streams to make the data highly available. We also demonstrated a robust checkpoint strategy to facilitate a Regional failover of the replication process when needed. The solution also illustrated how to use DynamoDB global tables for tracking the replication checkpoints and configuration. With this architecture, Vanguard was able to deploy workloads depending on the CDC data to multiple Regions to meet business needs of high availability in the face of service impairments impacting CDC pipelines in the primary Region.

If you have any feedback please leave a comment in the Comments section below.


About the authors

Raghu Boppanna works as an Enterprise Architect at Vanguard’s Chief Technology Office. Raghu specializes in Data Analytics, Data Migration/Replication including CDC Pipelines, Disaster Recovery and Databases. He has earned several AWS Certifications including AWS Certified Security – Specialty & AWS Certified Data Analytics – Specialty.

Parameswaran V Vaidyanathan is a Senior Cloud Resilience Architect with Amazon Web Services. He helps large enterprises achieve the business goals by architecting and building scalable and resilient solutions on the AWS Cloud.

Richa Kaul is a Senior Leader in Customer Solutions serving Financial Services customers. She is based out of New York. She has extensive experience in large scale cloud transformation, employee excellence, and next generation digital solutions. She and her team focus on optimizing value of cloud by building performant, resilient and agile solutions. Richa enjoys multi sports like triathlons, music, and learning about new technologies.

Mithil Prasad is a Principal Customer Solutions Manager with Amazon Web Services. In his role, Mithil works with Customers to drive cloud value realization, provide thought leadership to help businesses achieve speed, agility, and innovation.

Implementing architectural patterns with Amazon EventBridge Pipes

Post Syndicated from David Boyne original https://aws.amazon.com/blogs/compute/implementing-architectural-patterns-with-amazon-eventbridge-pipes/

This post is written by Dominik Richter (Solutions Architect)

Architectural patterns help you solve recurring challenges in software design. They are blueprints that have been used and tested many times. When you design distributed applications, enterprise integration patterns (EIP) help you integrate distributed components. For example, they describe how to integrate third-party services into your existing applications. But patterns are technology agnostic. They do not provide any guidance on how to implement them.

This post shows you how to use Amazon EventBridge Pipes to implement four common enterprise integration patterns (EIP) on AWS. This helps you to simplify your architectures. Pipes is a feature of Amazon EventBridge to connect your AWS resources. Using Pipes can reduce the complexity of your integrations. It can also reduce the amount of code you have to write and maintain.

Content filter pattern

The content filter pattern removes unwanted content from a message before forwarding it to a downstream system. Use cases for this pattern include reducing storage costs by removing unnecessary data or removing personally identifiable information (PII) for compliance purposes.

In the following example, the goal is to retain only non-PII data from “ORDER”-events. To achieve this, you must remove all events that aren’t “ORDER” events. In addition, you must remove any field in the “ORDER” events that contain PII.

While you can use this pattern with various sources and targets, the following architecture shows this pattern with Amazon Kinesis. EventBridge Pipes filtering discards unwanted events. EventBridge Pipes input transformers remove PII data from events that are forwarded to the second stream with longer retention.

Instead of using Pipes, you could connect the streams using an AWS Lambda function. This requires you to write and maintain code to read from and write to Kinesis. However, Pipes may be more cost effective than using a Lambda function.

Some situations require an enrichment function. For example, if your goal is to mask an attribute without removing it entirely. For example, you could replace the attribute “birthday” with an “age_group”-attribute.

In this case, if you use Pipes for integration, the Lambda function contains only your business logic. On the other hand, if you use Lambda for both integration and business logic, you do not pay for Pipes. At the same time, you add complexity to your Lambda function, which now contains integration code. This can increase its execution time and cost. Therefore, your priorities determine the best option and you should compare both approaches to make a decision.

To implement Pipes using the AWS Cloud Development Kit (AWS CDK), use the following source code. The full source code for all of the patterns that are described in this blog post can be found in the AWS samples GitHub repo.

const filterPipe = new pipes.CfnPipe(this, 'FilterPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceStream.streamArn,
  target: targetStream.streamArn,
  sourceParameters: { filterCriteria: { filters: [{ pattern: '{"data" : {"event_type" : ["ORDER"] }}' }] }, kinesisStreamParameters: { startingPosition: 'LATEST' } },
  targetParameters: { inputTemplate: '{"event_type": <$.data.event_type>, "currency": <$.data.currency>, "sum": <$.data.sum>}', kinesisStreamParameters: { partitionKey: 'event_type' } },
});

To allow access to source and target, you must assign the correct permissions:

const pipeRole = new iam.Role(this, 'FilterPipeRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

sourceStream.grantRead(pipeRole);
targetStream.grantWrite(pipeRole);

Message translator pattern

In an event-driven architecture, event producers and consumers are independent of each other. Therefore, they may exchange events of different formats. To enable communication, the events must be translated. This is known as the message translator pattern. For example, an event may contain an address, but the consumer expects coordinates.

If a computation is required to translate messages, use the enrichment step. The following architecture diagram shows how to accomplish this enrichment via API destinations. In the example, you can call an existing geocoding service to resolve addresses to coordinates.

There may be cases where the translation is purely syntactical. For example, a field may have a different name or structure.

You can achieve these translations without enrichment by using input transformers.

Here is the source code for the pipe, including the role with the correct permissions:

const pipeRole = new iam.Role(this, 'MessageTranslatorRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com'), inlinePolicies: { invokeApiDestinationPolicy } });

sourceQueue.grantConsumeMessages(pipeRole);
targetStepFunctionsWorkflow.grantStartExecution(pipeRole);

const messageTranslatorPipe = new pipes.CfnPipe(this, 'MessageTranslatorPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: targetStepFunctionsWorkflow.stateMachineArn,
  enrichment: enrichmentDestination.apiDestinationArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
});

Normalizer pattern

The normalizer pattern is similar to the message translator but there are different source components with different formats for events. The normalizer pattern routes each event type through its specific message translator so that downstream systems process messages with a consistent structure.

The example shows a system where different source systems store the name property differently. To process the messages differently based on their source, use an AWS Step Functions workflow. You can separate by event type and then have individual paths perform the unifying process. This diagram visualizes that you can call a Lambda function if needed. However, in basic cases like the preceding “name” example, you can modify the events using Amazon States Language (ASL).

In the example, you unify the events using Step Functions before putting them on your event bus. As is often the case with architectural choices, there are alternatives. Another approach is to introduce separate queues for each source system, connected by its own pipe containing only its unification actions.

This is the source code for the normalizer pattern using a Step Functions workflow as enrichment:

const pipeRole = new iam.Role(this, 'NormalizerRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

sourceQueue.grantConsumeMessages(pipeRole);
enrichmentWorkflow.grantStartSyncExecution(pipeRole);
normalizerTargetBus.grantPutEventsTo(pipeRole);

const normalizerPipe = new pipes.CfnPipe(this, 'NormalizerPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: normalizerTargetBus.eventBusArn,
  enrichment: enrichmentWorkflow.stateMachineArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
});

Claim check pattern

To reduce the size of the events in your event-driven application, you can temporarily remove attributes. This approach is known as the claim check pattern. You split a message into a reference (“claim check”) and the associated payload. Then, you store the payload in external storage and add only the claim check to events. When you process events, you retrieve relevant parts of the payload using the claim check. For example, you can retrieve a user’s name and birthday based on their userID.

The claim check pattern has two parts. First, when an event is received, you split it and store the payload elsewhere. Second, when the event is processed, you retrieve the relevant information. You can implement both aspects with a pipe.

In the first pipe, you use the enrichment to split the event, in the second to retrieve the payload. Below are several enrichment options, such as using an external API via API Destinations, or using Amazon DynamoDB via Lambda. Other enrichment options are Amazon API Gateway and Step Functions.

Using a pipe to split and retrieve messages has three advantages. First, you keep events concise as they move through the system. Second, you ensure that the event contains all relevant information when it is processed. Third, you encapsulate the complexity of splitting and retrieving within the pipe.

The following code implements a pipe for the claim check pattern using the CDK:

const pipeRole = new iam.Role(this, 'ClaimCheckRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

claimCheckLambda.grantInvoke(pipeRole);
sourceQueue.grantConsumeMessages(pipeRole);
targetWorkflow.grantStartExecution(pipeRole);

const claimCheckPipe = new pipes.CfnPipe(this, 'ClaimCheckPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: targetWorkflow.stateMachineArn,
  enrichment: claimCheckLambda.functionArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
  targetParameters: { stepFunctionStateMachineParameters: { invocationType: 'FIRE_AND_FORGET' } },
});

Conclusion

This blog post shows how you can implement four enterprise integration patterns with Amazon EventBridge Pipes. In many cases, this reduces the amount of code you have to write and maintain. It can also simplify your architectures and, in some scenarios, reduce costs.

You can find the source code for all the patterns on the AWS samples GitHub repo.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

Serverless logging with Amazon OpenSearch Service and Amazon Kinesis Data Firehose

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/serverless-logging-with-amazon-opensearch-service-and-amazon-kinesis-data-firehose/

In this post, you will learn how you can use Amazon Kinesis Data Firehose to build a log ingestion pipeline to send VPC flow logs to Amazon OpenSearch Serverless. First, you create the OpenSearch Serverless collection you use to store VPC flow logs, then you create a Kinesis Data Firehose delivery pipeline that forwards the flow logs to OpenSearch Serverless. Finally, you enable delivery of VPC flow logs to your Firehose delivery stream. The following diagram illustrates the solution workflow.

OpenSearch Serverless is a new serverless option offered by Amazon OpenSearch Service. OpenSearch Serverless makes it simple to run petabyte-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. OpenSearch Serverless automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

Kinesis Data Firehose is a popular service that delivers streaming data from over 20 AWS services to over 15 analytical and observability tools such as OpenSearch Serverless. Kinesis Data Firehose is great for those looking for a fast and easy way to send your VPC flow logs data to your OpenSearch Serverless collection in minutes without a single line of code and without building or managing your own data ingestion and delivery infrastructure.

VPC flow logs capture the traffic information going to and from your network interfaces in your VPC. With the launch of Kinesis Data Firehose support to OpenSearch Serverless, it makes an easy solution to analyze your VPC flow logs with just a few clicks. Kinesis Data Firehose provides a true end-to-end serverless mechanism to deliver your flow logs to OpenSearch Serverless, where you can use OpenSearch Dashboards to search through those logs, create dashboards, detect anomalies, and send alerts. VPC flow logs helps you to answer questions like:

  • What percentage of your traffic is getting dropped?
  • How much traffic is getting generated for specific sources and destinations?

Create your OpenSearch Serverless collection

To get started, you first create a collection. An OpenSearch Serverless collection is a logical grouping of one or more indexes that represent an analytics workload. Complete the following steps:

  1. On the OpenSearch Service console, choose Collections under Serverless in the navigation pane.
  2. Choose Create a collection.
  3. For Collection name, enter a name (for example, vpc-flow-logs).
  4. For Collection type¸ choose Time series.
  5. For Encryption, choose your preferred encryption setting:
    1. Choose Use AWS owned key to use an AWS managed key.
    2. Choose a different AWS KMS key to use your own AWS Key Management Service (AWS KMS) key.
  6. For Network access settings, choose your preferred setting:
    1. Choose VPC to use a VPC endpoint.
    2. Choose Public to use a public endpoint.

AWS recommends that you use a VPC endpoint for all production workloads. For this walkthrough, select Public.

  1. Choose Create.

It should take couple of minutes to create the collection.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless collection via the preceding steps.

At this point, you have successfully created a collection for OpenSearch Serverless. Next, you create a delivery pipeline for Kinesis Data Firehose.

Create a Kinesis Data Firehose delivery stream

To set up a delivery stream for Kinesis Data Firehose, complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, specify Direct PUT.

Check out Source, Destination, and Name to learn more about different sources supported by Kinesis Data Firehose.

  1. For Destination, choose Amazon OpenSearch Serverless.
  2. For Delivery stream name, enter a name (for example, vpc-flow-logs).
  3. Under Destination settings, in the OpenSearch Serverless collection settings, choose Browse.
  4. Select vpc-flow-logs.
  5. Choose Choose.

If your collection is still creating, wait a few minutes and try again.

  1. For Index, specify vpc-flow-logs.
  2. In the Backup settings section, select Failed data only for the Source record backup in Amazon S3.

Kinesis Data Firehose uses Amazon Simple Storage Service (Amazon S3) to back up failed data that it attempts to deliver to your chosen destination. If you want to keep all data, select All data.

  1. For S3 Backup Bucket, choose Browse to select an existing S3 bucket, or choose Create to create a new bucket.
  2. Choose Create delivery stream.

The following graphic gives a quick demonstration of creating the Kinesis Data Firehose delivery stream via the preceding steps.

At this point, you have successfully created a delivery stream for Kinesis Data Firehose, which you will use to stream data from your VPC flow logs and send it to your OpenSearch Serverless collection.

Set up the data access policy for your OpenSearch Serverless collection

Before you send any logs to OpenSearch Serverless, you need to create a data access policy within OpenSearch Serverless that allows Kinesis Data Firehose to write to the vpc-flow-logs index in your collection. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose the Configuration tab on the details page for the vpc-flow-logs delivery stream you just created.
  2. In the Permissions section, note down the AWS Identity and Access Management (IAM) role.
  3. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  4. Under Data access, choose Manage data access.
  5. Choose Create access policy.
  6. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  7. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Provide the IAM role name that you got from the permissions page of the Firehose delivery stream, and the account ID for your AWS account.
    [
      {
        "Rules": [
          {
            "ResourceType": "index",
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:WriteDocument",
              "aoss:CreateIndex",
              "aoss:UpdateIndex"
            ]
          }
        ],
        "Principal": [
          "arn:aws:sts::<aws-account-id>:assumed-role/<IAM-role-name>/*"
        ]
      }
    ]

  8. Choose Create.

The following graphic gives a quick demonstration of creating the data access policy via the preceding steps.

Set up VPC flow logs

In the final step of this post, you enable flow logs for your VPC with the destination as Kinesis Data Firehose, which sends the data to OpenSearch Serverless.

  1. Navigate to the AWS Management Console.
  2. Search for “VPC” and then choose Your VPCs in the search result (hover over the VPC rectangle to reveal the link).
  3. Choose the VPC ID link for one of your VPCs.
  4. On the Flow Logs tab, choose Create flow log.
  5. For Name, enter a name.
  6. Leave the Filter set to All. You can limit the traffic by selecting Accept or Reject.
  7. Under Destination, select Send to Kinesis Firehose in the same account.
  8. For Kinesis Firehose delivery stream name, choose vpc-flow-logs.
  9. Choose Create flow log.

The following graphic gives a quick demonstration of creating a flow log for your VPC following the preceding steps.

Examine the VPC flow logs data in your collection using OpenSearch Dashboards

You won’t be able to access your collection data until you configure data access. Data access policies allow users to access the actual data within a collection.

To create a data access policy for OpenSearch Dashboards, complete the following steps:

  1. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  2. Under Data access, choose Manage data access.
  3. Choose Create access policy.
  4. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  5. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Additionally, provide the IAM user and the account ID for your AWS account. You need to make sure that you have the AWS access and secret keys for the principal that you specified as an IAM user.
    [
      {
        "Rules": [
          {
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:ReadDocument"
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          "arn:aws:iam::<aws-account-id>:user/<IAM-user-name>"
        ]
      }
    ]

  6. Choose Create.
  7. Navigate to OpenSearch Serverless and choose the collection you created (vpc-flow-logs).
  8. Choose the OpenSearch Dashboards URL and log in with your IAM access key and secret key for the user you specified under Principal.
  9. Navigate to dev tools within OpenSearch Dashboards and run the following query to retrieve the VPC flow logs for your VPC:
    GET <index-name>/_search
    {
      "query": {
        "match_all": {}
      }
    }

The query returns the data as shown in the following screenshot, which contains information such as account ID, interface ID, source IP address, destination IP address, and more.

Create dashboards

After the data is flowing into OpenSearch Serverless, you can easily create dashboards to monitor the activity in your VPC. The following example dashboard shows overall traffic, accepted and rejected traffic, bytes transmitted, and some charts with the top sources and destinations.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

  1. Return to the AWS console and in the VPCs section, disable the flow logs for your VPC.
  2. In the OpenSearch Serverless dashboard, delete your vpc-flow-logs collection.
  3. On the Kinesis Data Firehose console, delete your vpc-flow-logs delivery stream.

Conclusion

In this post, you created an end-to-end serverless pipeline to deliver your VPC flow logs to OpenSearch Serverless using Kinesis Data Firehose. In this example, you built a delivery pipeline for your VPC flow logs, but you can also use Kinesis Data Firehose to send logs from Amazon Kinesis Data Streams and Amazon CloudWatch, which in turn can be sent to OpenSearch Serverless collections for running analytics on those logs. With serverless solutions on AWS, you can focus on your application development rather than worrying about the ingestion pipeline and tools to visualize your logs.

Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and check out other pipelines for analyzing your logs.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Automate deployment and version updates for Amazon Kinesis Data Analytics applications with AWS CodePipeline

Post Syndicated from Anand Shah original https://aws.amazon.com/blogs/big-data/automate-deployment-and-version-updates-for-amazon-kinesis-data-analytics-applications-with-aws-codepipeline/

Amazon Kinesis Data Analytics is the easiest way to transform and analyze streaming data in real time using Apache Flink. Customers are already using Kinesis Data Analytics to perform real-time analytics on fast-moving data generated from data sources like IoT sensors, change data capture (CDC) events, gaming, social media, and many others. Apache Flink is a popular open-source framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Although building Apache Flink applications is typically the responsibility of a data engineering team, automating the deployment and provisioning infrastructure as code (IaC) is usually owned by the platform (or DevOps) team.

The following are typical responsibilities of the data engineering role:

  • Write code for real-time analytics Apache Flink applications
  • Roll out new application versions or roll them back (for example, in the case of a critical bug)

The following are typical responsibilities of the platform role:

  • Write code for IaC
  • Provision the required resources in the cloud and manage their access

In this post, we show how you can automate deployment and version updates for Kinesis Data Analytics applications and allow both Platform and engineering teams to effectively collaborate and co-own the final solution using AWS CodePipeline with the AWS Cloud Development Kit (AWS CDK).

Solution overview

To demonstrate the automated deployment and version update of a Kinesis Data Analytics application, we use the following example real-time data analytics architecture for this post.

Real-time data analytics architecture

The workflow includes the following steps:

  1. An AWS Lambda function (acting as data source) is the event producer pushing events on demand to Amazon Kinesis Data Streams when invoked.
  2. The Kinesis data stream receives and stores real-time events.
  3. The Kinesis Data Analytics application reads events from the data stream and performs real-time analytics on it.

Generic architecture

You can refer to the following generic architecture to adapt this example to your preferred CI/CD tool (for example, Jenkins). The overall deployment process is divided into three high-level parts:

  1. Infrastructure CI/CD – This portion is highlighted in orange. The infrastructure CI/CD pipeline is responsible for deploying all the real-time streaming architecture components, including the Kinesis Data Analytics application and any connected resources typically deployed using AWS CloudFormation.
  2. ApplicationStack – This portion is highlighted in gray. The application stack is deployed by the infrastructure CI/CD component using AWS CloudFormation.
  3. Application CI/CD – This portion is highlighted in green. The application CI/CD pipeline updates the Kinesis Data Analytics application in three steps:
    1. The pipeline builds the Java or Python source code of the Kinesis Data Analytics application and produces the application as a binary file.
    2. The pipeline pushes the latest binary file to the Amazon Simple Storage Service (Amazon S3) artifact bucket after a successful build as Kinesis Data Analytics application binary files are referenced from S3.
    3. The S3 bucket file put event triggers a Lambda function, which updates the version of the Kinesis Data Analytics application by deploying the latest binary.

The following diagram illustrates this workflow.

Workflow illustrated

CI/CD architecture with CodePipeline

In this post, we implement the generic architecture using CodePipeline. The following diagram illustrates our updated architecture.

Updated architecture illustrated

The final solution includes the following steps:

  1. The platform (DevOps) team and data engineering team push their source code to their respective code repositories.
  2. CodePipeline deploys the whole infrastructure as three stacks:
    1. InfraPipelineStack – Contains a pipeline to deploy the overall infrastructure.
    2. ApplicationPipelineStack – Contains a pipeline to build and deploy Kinesis Data Analytics application binaries. In this post, we build a Java source using the JavaBuildPipeline AWS CDK construct. You can use the PythonBuildPipeline AWS CDK construct to build a Python source.
    3. ApplicationStack – Contains real-time data analytics pipeline resources including Lambda (data source), Kinesis Data Streams (storage), and Kinesis Data Analytics (Apache Flink application).

Deploy resources using AWS CDK

The following GitHub repository contains the AWS CDK code to create all the necessary resources for the data pipeline. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

  1. Clone the GitHub repository to your local computer using the following command:
git clone https://github.com/aws-samples/automate-deployment-and-version-update-of-kda-application
  1. Download and install the latest Node.js.
  2. Run the following command to install the latest version of AWS CDK:
npm install -g aws-cdk
  1. Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command.
cdk bootstrap aws://123456789012/us-east-1

To learn more about the bootstrapping process, refer to Bootstrapping.

Part 1: Data engineering and platform teams push source code to their code repositories

The data engineering and platform teams begin work in their respective code repositories, as illustrated in the following figure.

The data engineering and platform teams begin work in their respective code repositories, as illustrated in the following figure.

In this post, we use two folders instead of two GitHub repositories, which you can find under the root folder of the cloned repository:

  • kinesis-analytics-application – This folder contains example source code of the Kinesis Data Analytics application. This represents your Kinesis Data Analytics application source code developed by your data engineering team.
  • infrastructure-cdk – This folder contains example AWS CDK source code of the final solution used for provisioning all the required resources and CodePipeline. You can reuse this code for your Kinesis Data Analytics application deployment.

Application development teams usually stores the application source code in git repositories. For the demonstration purpose, we will use source code as zip file downloaded from Github instead of connecting CodePipeline to the Github repository. You may want to directly connect source repository with CodePipeline. To learn more about how to connect, refer to Create a connection to GitHub.

Part 2: The platform team deploys the application pipeline

The following figure illustrates the next step in the workflow.

Next step in the workflow illustrated

In this step, you deploy the first pipeline to build the Java source code from kinesis-analytics-application. Complete the following steps to deploy ApplicationPipelineStack:

  1. Open your terminal, bash, or command window depending on your OS.
  2. Switch the current path to the folder infrastructure-cdk.
  3. Run npm install to download all dependencies.
  4. Run cdk deploy ApplicationPipelineStack to deploy the application pipeline.

This process should take about 5 minutes to complete and deploys the following resources to your AWS account, highlighted in green in the preceding diagram:

  • CodePipeline, containing stages for AWS CodeBuild and AWS CodeDeploy
  • An S3 bucket to store binaries
  • A Lambda function to update the Kinesis Data Analytics application JAR after manual approval

Trigger an automatic build for the application pipeline

After the cdk deploy command is successful, complete the following steps to automatically run the pipeline:

  1. Download the source code .zip file.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose the stack ApplicationPipelineStack.Choose the stack ApplicationPipelineStack.
  4. On the Outputs tab, choose the link for the key ArtifactBucketLink.On the Outputs tab, choose the link for the key ArtifactBucketLink.

You’re redirected to the S3 artifact bucket.

  1. Choose Upload.
  2. Upload the source code .zip file you downloaded.

The first pipeline run (shown as Auto Build in the following diagram) starts automatically and takes about 5 minutes to reach the manual approval stage. The pipeline automatically downloads the source code from the artifact bucket, builds the Java project kinesis-analytics-application using Maven, and publishes the output binary JAR file back to the artifact bucket under the directory jars.

The pipeline automatically downloads the source code from the artifact bucket, builds the Java project kinesis-analytics-application using Maven, and publishes the output binary JAR file back to the artifact bucket under the directory jars.

View the application pipeline run

Complete the following steps to view the application pipeline run:

  1. On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
  2. On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Do not approve the build for the manual approval stage yet; this is done later.

Part 3: The platform team deploys the infrastructure pipeline

The application pipeline run publishes a JAR file named kinesis-analytics-application-final.jar to the artifact bucket. Next, we deploy the Kinesis Data Analytics architecture. Complete the following steps to deploy the example flow:

  1. Open a terminal, bash, or command window depending on your OS.
  2. Switch the current path to the folder infrastructure-cdk.
  3. Run cdk deploy InfraPipelineStack to deploy the infrastructure pipeline.

This process should take about 5 minutes to complete and deploys a pipeline containing stages for CodeBuild and CodeDeploy to your AWS account, as highlighted in green in the following diagram.

This process should take about 5 minutes to complete and deploys a pipeline containing stages for CodeBuild and CodeDeploy to your AWS account, as highlighted in green in the following diagram.

When the cdk deploy is complete, the infrastructure pipeline run starts automatically (shown as Auto Build 1 in the following diagram) and takes about 10 minutes to download the source code from the artifact bucket, build the AWS CDK project infrastructure-stack, and deploy ApplicationStack automatically to your AWS account. When the infrastructure pipeline run is complete, the following resources are deployed to your account (shown in green in following diagram):

  • A CloudFormation template named app-ApplicationStack
  • A Lambda function acting as a data source
  • A Kinesis data stream acting as the stream storage
  • A Kinesis Data Analytics application with the first version of kinesis-analytics-application-final.jarWhen the infrastructure pipeline run is complete, the following resources are deployed to your account (shown in green in following diagram):

View the infrastructure pipeline run

Complete the following steps to view the application pipeline run:

  1. On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
  2. On the Outputs tab, choose the link for the key InfraCodePipelineLink.On the Outputs tab, choose the link for the key InfraCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Step 4: The data engineering team deploys the application

Now your account has everything in place for the data engineering team to work independently and roll out new versions of the Kinesis Data Analytics application. You can approve the respective application build from the application pipeline to deploy new versions of the application. The following diagram illustrates the full workflow.

Diagram illustrates the full workflow.

The build process starts automatically when it detects changes in the source code. You can test a version update by re-uploading the source code .zip file to the S3 artifact bucket. In a real-world use case, you update the main branch either via a pull request or by merging your changes, and this action triggers a new pipeline run automatically.

View the current application version

To view the current version of the Kinesis Data Analytics application, complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
  2. On the Outputs tab, choose the link for the key KDAApplicationLink.On the Outputs tab, choose the link for the key KDAApplicationLink.

You’re redirected to the Kinesis Data Analytics application details page. You can find the current application version by looking at Version ID.

Find the current application version by looking at Version ID

Approve the application deployment

Complete the following steps to approve the deployment (or version update) of the Kinesis Data Analytics application:

  1. On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
  2. On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.
  3. Choose Review from the pipeline approval stage.Choose Review from the pipeline approval stage
  4. When prompted, choose Approve to provide approval (optionally adding any comments) for the Kinesis Data Analytics application deployment or version update.Choose Approve to provide approval
  5. Repeat the steps mentioned earlier to view the current application version.

You should see the application version as defined in Version ID increased by one, as shown in the following screenshot.

Application version as defined in Version ID increased by one

Deploying a new version of the Kinesis Data Analytics application will cause a downtime of around 5 minutes because the Lambda function responsible for the version update makes the API call UpdateApplication, which restarts the application after updating the version. However, the application resumes stream processing where it left off after the restart.

Clean up

Complete the following steps to delete your resources and stop incurring costs:

  1. On the AWS CloudFormation console, select the stack InfraPipelineStack and choose Delete.
  2. Select the stack app-ApplicationStack and choose Delete.
  3. Select stack ApplicationPipelineStack and choose Delete.
  4. On the Amazon S3 console, select the bucket with the name starting with javaappCodePipeline and choose Empty.
  5. Enter permanently delete to confirm the choice.
  6. Select the bucket again and choose Delete.
  7. Confirm the action by entering the bucket name when prompted.
  8. Repeat these steps to delete the bucket with the name starting with infrapipelinestack-pipelineartifactsbucket.

Summary

This post demonstrated how to automate deployment and version updates for your Kinesis Data Analytics applications using CodePipeline and AWS CDK.

For more information, see Continuous integration and delivery (CI/CD) using CDK Pipelines and CodePipeline tutorials.


About the Author

About the AuthorAnand Shah is a Big Data Prototyping Solutions Architect at AWS. He works with AWS customers and their engineering teams to build prototypes using AWS analytics services and purpose-built databases. Anand helps customers solve the most challenging problems using the art of the possible technology. He enjoys beaches in his leisure time.

Build highly available streams with Amazon Kinesis Data Streams

Post Syndicated from Jeremy Ber original https://aws.amazon.com/blogs/big-data/build-highly-available-streams-with-amazon-kinesis-data-streams/

Many use cases are moving towards a real-time data strategy due to demand for real-time insights, low-latency response times, and the ability to adapt to the changing needs of end-users. For this type of workload, you can use Amazon Kinesis Data Streams to seamlessly provision, store, write, and read data in a streaming fashion. With Kinesis Data Streams, there are no servers to manage, and you can scale your stream to handle any additional throughput as it comes in.

Kinesis Data Streams offers 99.9% availability in a single AWS Region. For even higher availability, there are several strategies to explore within the streaming layer. This post compares and contrasts different strategies for creating a highly available Kinesis data stream in case of service interruptions, delays, or outages in the primary Region of operation.

Considerations for high availability

Before we dive into our example use case, there are several considerations to keep in mind when designing a highly available Kinesis Data Streams workload that relate to the business need for a particular pipeline:

  • Recovery Time Objective (RTO) is defined by the organization. RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
  • Recovery Point Objective (RPO) is defined by the organization. RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

In a general sense, the lower your values for RPO and RTO, the more expensive the overall solution becomes. This is because the solution needs to account for minimizing both data loss and service unavailability by having multiple instances of the service up and running in multiple Regions. This is why a big piece of high availability is the replication of data flowing through a workload. In our case, the data is replicated across Regions of Kinesis Data Streams. Conversely, the higher the RPO and RTO values are, the more complexity you introduce into your failover mechanism. This is due to the fact that the cost savings you realize by not standing up multiple instances across multiple Regions are offset by the orchestration needed to spin up these instances in the event of an outage.

In this post, we are only covering failover of a Kinesis data stream. In use cases where higher availability is required across the entire data pipeline, having failover architectures for every component (Amazon API Gateway, AWS Lambda, Amazon DynamoDB) is strongly encouraged.

The simplest approach to high availability is to start a new instance of producers, consumers, and data streams in a new Region upon service unavailability detection. The benefit here is primarily cost, but your RPO and RTO values will be higher as a result.

We cover the following strategies for highly available Kinesis Data Streams:

  • Warm standy – An architecture in which there is active replication of data from the Kinesis data stream in Region A to Region B. Consumers of the data stream are running in both Regions at all times. Recommended for use cases that can’t withstand extended downtime past their replication lag.
  • Cold standby – Active replication of data from the data stream in Region A to Region B, but consumers of the data stream in Region B are spun up when an outage in Region A is detected. Recommended for use cases that can afford some downtime as infrastructure is spun up in secondary Region. In this scenario, RPO will be similar to the warm standby strategy; however, RTO will increase.

For high availability purposes, these use cases need to replicate the data across Regions in a way that allows consumers and producers of the data stream to fail over quickly upon detection of a service unavailability and utilize the secondary Region’s stream. Let’s take an example architecture to further explain these DR strategies. We use API Gateway and Lambda to publish stock ticker information to a Kinesis data stream. The data is then retrieved by another Lambda consumer to save durably into DynamoDB for querying, alerting, and reporting. The following diagram illustrates this architecture.

the primary architecture for the post--showcasing data coming from a mobile phone to API Gateway, then AWS Lambda, then Kinesis Data Streams, Lambda again and finally publishing to a DynamoDB Table

We use this architecture with an example use case requiring the streaming workload to be highly available in the event of a Region outage. The customer can withstand an RTO of 15 minutes during an outage, because they refresh end-users’ dashboards on a 15-minute interval. The customer is sensitive to downtime and data loss, because their data will be used for historical auditing purposes, operational metrics, and dashboards for end-users. Downtime for this customer means that data isn’t able to be persisted in their database from their streaming layer, and therefore unavailable to any consuming application. For this use case, data can be retried up to 5 minutes from our Lambda function before failing over to the new Region. Consumers are considered unavailable when the stream is unavailable, and can scale up in the secondary Region to account for any backlog of events.

How might we approach making a Kinesis data stream highly available for this use case?

Warm standby pattern

The following architecture diagram illustrates the warm standby high availability pattern for Kinesis Data Streams.

warm standby pattern showcasing data being replicated between a kinesis data stream in one region to another

image showcasing the warm standby failover--where data from first lambda begins replicating to secondary region KDA

The warm standby architectural pattern involves running a Kinesis data stream both in the primary and secondary Region, along with consumers and downstream destinations of the primary Region’s streaming layer being replicated as well. Sources are configured to automatically fail over to the secondary Region in the case of service unavailability in the first Region. We dive into details of how to achieve this in the client failover section of this post. Data is replicated across Regions from the data stream in the primary Region to the secondary Region. This is done instead of having the sources publish to both Regions to avoid any consistency issues between the streams in the two Regions.

Although this architectural pattern gives very high availability, it’s also the most expensive option because we’re duplicating virtually the entire streaming layer across two Regions. For business use cases that can’t withstand extended data loss or withstand downtime, this may be the best option for them. From an RTO perspective, this architectural pattern ensures there will be no downtime. There is some nuance in the RPO metric in that it depends heavily on the replication lag. In the event of the primary stream becoming unavailable, whatever data hasn’t yet been replicated may be unavailable in the secondary Region. This data won’t be considered lost, but may be unavailable for consumption until the primary stream becomes available again. This method also can result in events being out of order.

For business needs that can’t tolerate this level of record unavailability, consider retaining data on the producer for the purposes of publishing to an available stream when available, or rewinding against the source for the producer if possible so that data stuck in the primary Region can be resent to the secondary stream upon failover. We cover this consideration in the client failover section of this post.

Cold standby pattern

The following architecture diagram illustrates the cold standby high availability pattern for Kinesis Data Streams.

active passive pattern for kinesis data streams

The cold standby architectural pattern involves running a data stream both in the primary and secondary Region, and spinning up the downstream resources like a stream consumer and destination for streams when a service interruption is detected—passive mode. Just like the warm standby pattern, sources are configured to automatically fail over to the secondary Region in the case of service unavailability in the first Region. Likewise, data is replicated across Regions from the data stream in the primary Region to the secondary Region.

The primary benefit this architectural pattern provides is cost efficiency. By not running consumers at all times, this effectively reduces your costs significantly compared to the warm standby pattern. However, this pattern may introduce some data unavailability for downstream systems while the secondary Region infrastructure is provisioned. Additionally, depending on replication lag, some records may be unavailable, as discussed in the warm standby pattern. It should be noted that depending on how long it takes to spin up resources, it may take some time for consumers to reprocess the data in the secondary Region, and latency can be introduced when failing over. Our implementation assumes a minimal replication lag and that downstream systems have the ability to reprocess a configurable amount of data to catch up to the tip of the stream. We discuss approaches to spinning these resources up in the client failover section, but one possible approach to this would be using an AWS CloudFormation template that spins these resources up on service unavailability detection.

For business needs that can tolerate some level of data unavailability and can accept interruptions while the new infrastructure in the secondary Region is spun up, this is an option to consider both from a cost perspective and an RPO/RTO perspective. The complexity of spinning up resources upon detecting service unavailability is offset by the lower cost of the overall solution.

Which pattern makes sense for our use case?

Let’s revisit the use case described earlier to identify which of the strategies best meets our needs. We can extract the pieces of information from the customer’s problem statement to identify that they need a high availability architecture that:

  • Can’t withstand extended amounts of data loss
  • Must resume operations within 15 minutes of service interruption identification

This criterion tells us that their RPO is close to zero, and their RTO is 15 minutes. From here, we can determine that the cold standby architecture with data replication provides us limited data loss, and the maximum downtime will be determined by the time it takes to provision consumers and downstream destinations in the secondary Region.

Let’s dive deeper into the implementation details of each of the core phases of high availability, including an implementation guide for our use case.

Launch AWS CloudFormation resources

If you want to follow along with our code samples, you can launch the following CloudFormation stack and follow the instructions in order to simulate the cold standby architecture referenced in this post.

Launch Stack

For purposes of the Kinesis Data Streams high availability setup demo, we use us-west-2 as the primary Region and us-east-2 as the failover Region. While deploying this solution in your own account, you can choose your own primary and failover Regions.

  1. Deploy the supplied CloudFormation template in failover Region us-east-2.

Make sure you specify us-east-2 as the value for the FailoverRegion parameter in the CloudFormation template.

  1. Deploy the supplied CloudFormation template in primary Region us-west-2.

Make sure you specify us-east-2 as the value for the FailoverRegion parameter in the CloudFormation template.

In steps 1 and 2, we deployed the following resources in the primary and failover Regions:

  1. KDS-HA-Stream – AWS::Kinesis::Stream (primary and failover Region)
  2. KDS-HA-ProducerLambda – AWS::Lambda::Function (primary Region)
  3. KDS-HA-ConsumerLambda – AWS::Lambda::Function (primary and failover Region)
  4. KDS-HA-ReplicationAgentLambda – AWS::Lambda::Function (primary Region)
  5. KDS-HA-FailoverLambda – AWS::Lambda::Function (primary Region)
  6. ticker-prices – AWS::DynamoDB::GlobalTable (primary and failover Region)

The KDS-HA-Stream Kinesis data stream is deployed in both Regions. An enhanced fan-out consumer of the KDS-HA-Stream stream KDS-HA-ReplicationAgentLambda in the primary Region is responsible for replicating messages to the data stream in the failover Region.

KDS-HA-ConsumerLambda is a Lambda function consuming messages out of the KDS-HA-Stream stream and persisting data into a DynamoDB table after preprocessing.

You can inspect the content of the ticker-prices DynamoDB table in the primary and failover Region. Note that last_updated_region attribute shows us-west-2 as its value because it’s the primary Region.

Replication

When deciding how to replicate data from a data stream in Region A to a data stream in Region B, there are several strategies that involve a consumer reading data off of the primary stream and sending that data cross-Region to the secondary data stream. This would act as a replicator service, responsible for copying the data between the two streams, maintaining a relatively low latency to replicate and ensuring data isn’t lost during this replication.

Because replication off of a shared throughput data stream could impact the flow of data in a production workload, we recommend using the enhanced fan-out feature of Kinesis Data Streams consumers to ensure replication doesn’t have an impact on consumption latency.

The replication strategy implemented in this post features asynchronous replication, meaning that the replication process doesn’t block any standard data flow in the primary stream. Synchronous replication would be a safer approach to guarantee replication and avoid data loss; however, this isn’t possible without a service-side implementation.

The following image shows a timeline of data flow for the cold standby architecture, with data being replicated as soon as it’s published.

Lambda replication

Lambda can treat a Kinesis data stream as an event source, which will funnel events from your data stream into a Lambda function. This Lambda function then receives and forwards these events across Regions to your data stream in a secondary Region. Lambda functions allow you to utilize best streaming practices such as retries of records that encounter errors, bisect on error functionality, and using the Lambda parallelization factor; using more instances of your Lambda function than you have available shards can help process records faster.

This Lambda function is at the crux of the architecture for high availability; it’s responsible solely for sending data across Regions, and it also has the best capability to monitor the replication progress. Important metrics to monitor for Lambda replication include IteratorAge, which indicates how old the last record in the batch was when it finished processing. A high IteratorAge value indicates that the Lambda function is falling behind and therefore is not keeping up with data ingestion for replication purposes. A high IteratorAge can lead to a higher RPO and the higher likelihood of data unavailability when a passive failover happens.

We use the following sample Lambda function in our CloudFormation template to replicate data across Regions:

import json
import boto3
import random
import os
import base64


def lambda_handler(event, context):
    client = boto3.client("kinesis", region_name=os.environ["FAILOVER_REGION"])
    records = []

    for record in event["Records"]:
        records.append(
            {
                "PartitionKey": record["kinesis"]["partitionKey"],
                "Data": base64.b64decode(record["kinesis"]["data"]).decode("utf-8"),
            }
        )
    response = client.put_records(Records=records, StreamName="KDS-HA-Stream")
    if response["FailedRecordCount"] > 0:
        print("Failed replicating data: " + json.dumps(response))
        raise Exception("Failed replicating data!")

The Lambda replicator in the CloudFormation template is configured to read from the data stream in the primary Region.

The following code contains the necessary AWS Identity and Access Management (IAM) permissions for Lambda, giving access for the Lambda function to assume this role. All actions are permitted on data streams and DynamoDB. In the principal of least privilege, it’s recommended to restrict this to the necessary streams in a production environment.

      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - 'kinesis:DescribeStream'
                  - 'kinesis:DescribeStreamSummary'
                  - 'kinesis:GetRecords'
                  - 'kinesis:GetShardIterator'
                  - 'kinesis:ListShards'
                  - 'kinesis:ListStreams'
                  - 'kinesis:SubscribeToShard'
                  - 'kinesis:PutRecords'
                Resource:
                  - 'arn:aws:kinesis:*:*:stream/KDS-HA-Stream'
                  - 'arn:aws:kinesis:*:*:stream/KDS-HA-Stream/consumer/KDS-HA-Stream-EFO-Consumer:*'
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/CloudWatchLogsFullAccess'
 

Health check

A generalized strategy for determining when to consider our data stream unavailable involves the use of Amazon CloudWatch metrics. We use the metrics coming off of our Lambda producer and consumer in order to assess the availability of our data stream. While producing to a data stream, an error might appear as one of the following responses back from the data stream: PutRecord or PutRecords returns an AmazonKinesisException 500 or AmazonKinesisException 503 error. When consuming from a data stream, an error might appear as one of the following responses back from the data stream: SubscribeToShard.Success or GetRecords returns an AmazonKinesisException 500 or AmazonKinesisException 503.

We can calculate our effective error rate based on PutRecord.Success and GetRecord.Success. An average error rate of 1% or higher over a time window of 5 minutes, for example, could indicate that there is an issue with the data stream, and we may want to fail over. In our CloudFormation template, this error rate threshold as well as time window are configurable, but by default we check for an error rate of 1% in the last 5 minutes to trigger a failover of our clients.

Client failover

When a data stream is deemed to be unreachable, we must now take action to keep our system available and reachable for clients on both ends of the interaction. This means for producers following the cold standby high availability architecture, we change the destination stream where the producer was writing. If high availability and failover of data producers and consumers isn’t a requirement of a given use case, a different architecture would be a better fit.

Prior to failover, the producer may have been delivering data to a stream in Region A, but we now automatically update the destination to be the stream in Region B. For different clients, the methodology of updating the producer will be different, but for ours, we store the active destination for producers in the Lambda environment variables from AWS CloudFormation and update our Lambda functions dynamically on health check failure scenarios.

For our use case, we use the maximum consumer lag time (iteratorAge) plus some buffer to influence the starting position of the failover consumer. This allows us to ensure that the consumer in the secondary Region doesn’t skip records that haven’t been processed in the originating Region, but some data overlap may occur. Note that some duplicates in the downstream system may be introduced, and having an idempotent sink or some method of handling duplicates must be implemented in order to avoid duplicate-related isssues.

In the case where data is successfully written to a data stream but is unable to be consumed from the stream, the data will not be replicated and therefore be unavailable in the second Region. The data will be durably stored in the primary data stream until it comes back online and can be read from. Note that if the stream is unavailable for a longer period of time than your total data retention period on the data stream, this data will be lost. Data retention for Kinesis Data Streams can be retrospectively increased up to 1 year.

For consumers in a cold standby architecture, upon failure detection, the consumer will be disabled or shut down, and the same consumer instance will be spun up in the secondary Region to consume from the secondary data stream. On the consumer side, we assume that the consumer application is stateless in our provided solution. If your application requires state, you can migrate or preload the application state via Amazon Simple Storage Service (Amazon S3) or a database. For a stateless application, the most important aspect of failover is the starting position.

In the following timeline, we can see that at some point, the stream in Region A was deemed unreachable.

The consumer application in Region A was reading data at time t10, and when it fails over to the secondary Region (B), it reads starting at t5 (5 minutes before the current iteratorAgeMilliseconds). This ensures that data isn’t skipped by the consumer application. Keep in mind that there may be some overlap in records in the downstream destinations.

In the provided cold standby AWS CloudFormation example, we can manually trigger a failover with the AWS Command Line Interface (AWS CLI). In the following code, we manually fail over to us-east-2:

aws lambda invoke --function-name KDS-HA-FailoverLambda --cli-binary-format raw-in-base64-out --payload '{}' response.json --region us-west-2

After a few minutes, you can inspect the content of the ticker-prices DynamoDB table in the primary and failover Region. Note that the last_updated_region attribute shows us-east-2 as its value because it’s failed over to the us-east-2 Region.

Failback

After an outage or service unavailability is deemed to be resolved, the next logical step is to reorient your clients back to their original operating Regions. Although it may be tempting to automate this procedure, a manual failback approach during off-business hours when minimal production disruption will take place makes more sense.

In the following images, we can visualize the timeline with which consumer applications are failed back to the original Region.

The producer switches back to the original Region, and we wait for the consumer in Region B to reach 0 lag. At this point, the consumer application in Region B is disabled, and replication to Region B is resumed. We have now returned to our normal state of processing messages as shown in the replication section of this post.

In our AWS CloudFormation setup, we perform a failback with the following steps:

  1. Re-enable the event source mapping and start consuming messages from the primary Region at the latest position:
aws lambda create-event-source-mapping --function-name KDS-HA-ConsumerLambda --batch-size 100 --event-source-arn arn:aws:kinesis:us-west-2:{{accountId}}:stream/KDS-HA-Stream --starting-position LATEST --region us-west-2
  1. Switch the producer back to the primary Region:
aws lambda update-function-configuration --function-name KDS-HA-ProducerLambda --environment "Variables={INPUT_STREAM=KDS-HA-Stream,PRODUCING_TO_REGION=us-west-2}" --region us-west-2
  1. In the failover Region (us-east-2), wait for your data stream’s GetRecords max iterator age (in milliseconds) CloudWatch metric to report 0 as a value. We’re waiting for the consumer Lambda function to catch up with all produced messages.
  2. Stop consuming messages from the failover Region.
  3. Run the following AWS CLI command and grab the UUID from the response, which we use to delete the existing event source mapping. Make sure you’re picking event source mapping for the Lambda function KDS-HA-ConsumerLambda.
aws lambda list-event-source-mappings --region us-east-2
aws lambda delete-event-source-mapping --uuid {{UUID}} --region us-east-2
  1. Restart the replication agent in the primary Region.
  2. Run following AWS CLI command, and capture ConsumerARN from the response:
aws kinesis list-stream-consumers --stream-arn arn:aws:kinesis:us-west-2:{{accountId}}:stream/KDS-HA-Stream --region us-west-2
aws lambda create-event-source-mapping --function-name KDS-HA-ReplicationAgentLambda --batch-size 100 --event-source-arn {{ConsumerARN}} --starting-position LATEST --region us-west-2

When this is complete, you can observe the same data stream metrics—the number of records in and out per second, consumer lag metrics, and number of errors as described in the health check section of this post—to ensure that each of the components has resumed processing data in the original Region. We can also take note of the data landing in DynamoDB, which displays which Region data is being updated from in order to determine the success of our failback procedure.

We recommend for any streaming workload that can’t withstand extended data loss or downtime to implement some form of cross-Region high availability in the unlikely event of service unavailability. These recommendations can help you determine which pattern is right for your use case.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the CloudFormation stack from primary Region us-west-2.
  2. Delete the CloudFormation stack from failover Region us-east-2.
  3. List all event source mappings in primary Region us-west-2 using the aws lambda list-event-source-mappings --region us-west-2 command and note the UUIDs of the event source mappings tied to the KDS-HA-ConsumerLambda and KDS-HA-ReplicationAgentLambda Lambda functions.
  4. Delete event source mappings in primary Region us-west-2 tied to the KDS-HA-ConsumerLambda and KDS-HA-ReplicationAgentLambda Lambda functions using the aws lambda delete-event-source-mapping --uuid {{UUID}} --region us-west-2 command and UUIDs noted in the previous step.

Conclusion

Building highly available Kinesis data streams across multiple Regions is multi-faceted, and all aspects of your RPO, RTO, and operational costs need to be carefully considered. The code and architecture discussed in this post is one of many different architectural patterns you can choose for your workloads, so make sure to choose the appropriate architecture based on the criteria for your specific requirements.

To learn more about Kinesis Data Streams, we have a getting started guide as well as a workshop to walk through all the integrations with Kinesis Data Streams. You can also contact your AWS Solutions Architects, who can be of assistance alongside your high availability journey.


About the Authors

Jeremy Ber has been working in the telemetry data space for the past 7 years as a Software Engineer, Machine Learning Engineer, and most recently a Data Engineer. In the past, Jeremy has supported and built systems that stream in terabytes of data per day, and process complex machine learning algorithms in real time. At AWS, he is a Senior Streaming Specialist Solutions Architect supporting both Amazon MSK and Amazon Kinesis.

Pratik Patel is a Sr Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices, and proactively helps keep customer’s AWS environments operationally healthy.