Tag Archives: Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Deploy real-time analytics with StarTree for managed Apache Pinot on AWS

2025-03-13 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/deploy-real-time-analytics-with-startree-for-managed-apache-pinot-on-aws/

This post is cowritten with Mayank Shrivastava and Barkha Herman from StarTree.

Building a low-latency, high-concurrency, real-time online analytical processing (OLAP) solution has been previously explored on the AWS Big Data Blog, where we walked through how to build a real-time analytics solution with Apache Pinot on AWS, in which streaming sources, such as Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis Data Streams, produce events that are ingested and processed in real time within Apache Pinot.

However, this approach requires self-management of the infrastructure required to run Pinot, as well as a number of manual processes to run in production. StarTree is a managed alternative that offers similar benefits for real-time analytics use cases.

In this post, we introduce StarTree as a managed solution on AWS for teams seeking the advantages of Pinot. We highlight the key distinctions between open-source Pinot and StarTree, and provide valuable insights for organizations considering a more streamlined approach to their real-time analytics infrastructure.

By examining these aspects, you can make an informed decision between open source Pinot and StarTree for your specific real-time analytics needs.

StarTree overview

One of the founders of Apache Pinot, Kishore Gopalakrishna, launched StarTree to equip organizations globally with the power of real-time data and build a fully managed platform for real-time analytics. Handling over 1 billion queries per week and ingesting over 1 million events per second, StarTree Cloud removes the burden of infrastructure management so companies can focus on delivering real-time insights to end-users.

Open source Pinot requires in-house expertise that can challenge well-established technical teams to provision hardware, configure environments, tune performance, maintain security, adhere to data governance requirements, manage software updates, and constantly monitor for system issues. Organizations interested in decreasing their time to value with a managed Pinot solution can take advantage of the expertise of StarTree’s team to accelerate setup, deploy an architecture ready for scale, and offload infrastructure maintenance.

Improving security with SOC 2, SSO, and RBAC

Critical enterprise security features can be challenging to implement in open source Pinot environments. With StarTree’s managed Pinot, role-based access control (RBAC) simplifies administration for Pinot and allows organizations to assign and monitor user access based on roles to enforce secure and efficient access to sensitive data. StarTree Cloud provides enterprise-grade security with SOC 2 compliance, enhanced encryption, and single sign-on (SSO) capabilities.

Using automated data ingestion at scale

The minion task framework is a native component of Pinot to offload computationally intensive tasks away from the other Pinot components to conserve resources for low-latency queries and support real-time stream ingestion. StarTree can handle larger volumes of data efficiently with highly scalable implementations of minion tasks and a minion auto scaling feature that eliminates unnecessary infrastructure costs during idle times, as seen in the below figure.

StarTree’s automatic data ingestion framework is ideal for enterprise workloads because it improves scalability and reduces the data maintenance complexity often found in open source Pinot deployments. StarTree supports a large number of managed connectors, which are used to maintain metadata about the source and ingest data seamlessly into the platform. The data is then modelled to help you organize and structure the data fetched from the selected data source into Pinot tables. Indexes are then configured to optimize query performance, as per the flow in the diagram below.

Tiered storage for real-time query processing

With open source Pinot, tiered storage can be used for deep storage like Amazon Simple Storage Service (Amazon S3) for backup but not query processing, because storage is tightly coupled with compute and requires manual configuration of tenants with different storage speeds and server specifications. In the following diagram, an Amazon S3 tier is defined for the data to be moved from tightly coupled SSD to cloud storage when the data is 30 days old.

On the other hand, StarTree transitions less-frequently accessed data to cost-effective storage like Amazon S3, while maintaining quick access to frequently accessed data. StarTree’s tiered storage enables automation for real-time query processing with index pinning, prefetching, and intelligent data movement between hot and cold storage, optimizing both performance and cost. StarTree’s sophisticated approach to tiered storage is highly flexible and reduces replication overhead by keeping a single copy in cloud storage, which prevents the limitations of compressed deep store copies, as you can see in the below diagram

Improving scalability with off-heap upserts

Companies like Amberdata benefit from StarTree’s upsert support to routinely upsert 350,000 events per second, with peak workloads reaching 1 million upserts per second. StarTree Cloud enhanced upsert functionality boosts efficiency, usability, and scalability through the implementation of off-heap upserts. Behind the scenes, Pinot servers manage specific upsert metadata to determine if a newly inserted record’s primary key was previously encountered and identifies the current segment holding it. As shown below, StarTree Cloud moves this off-heap, enabling a scalable cache of metadata as the on-heap memory restrictions are removed

Customer success stories using Pinot with StarTree for real-time analytics

The following customers highlight their success using Pinot for StarTree:

Sovrn provides down-to-the-second, real-time data for their customers with StarTree’s managed Pinot as an adtech solution provider for web publishers, down from what was previously a 24- to 48-hour turnaround time for producing reports.
Amberdata, a blockchain and crypto market intelligence company, uses StarTree for real-time analytics to improve query performance, reduce SLA times, and lower infrastructure costs. Joanes Espanol, CTO and Co-Founder of Amberdata, shared about their experience with StarTree’s managed Pinot, “We are now in the subseconds to milliseconds range, and the higher query concurrency means we can serve more customers faster. We’ve been able to reduce our infrastructure costs and reduce our dependencies on older technologies.”
Nubank identifies anomalies across massive datasets instantly with StarTree to power observability and anomaly detection in their customer-facing applications, enabling real-time monitoring and customer insights at scale.

Flexible deployment options for StarTree Cloud

StarTree offers multiple deployment options, including a StarTree hosted software as a service (SaaS) or customer hosted SaaS. StarTree hosted SaaS is ideal for organizations interested in fully offloading the operational burden of infrastructure management, scaling, performance tuning, and security from their team so they can focus on analytics. StarTree’s customer hosted SaaS provides flexibility for customers interested in deploying the solution within their AWS environment or other platform of choice. This is suitable for organizations who require higher infrastructure management controls in their perimeter but still want the operational ease of a managed service.

Self-managed Pinot or StarTree

Pinot can deliver value for real-time analytics scenarios with different deployment methods. The choice of deployment method will come down to organizational priorities and trade-offs. Teams with the capability and willingness to manage open source software on a commodity infrastructure at scale might opt to deploy self-managed Pinot on AWS. Teams interested in reducing time troubleshooting performance bottlenecks, optimizing resource usage, and minimizing downtime can use StarTree’s managed service.

Conclusion

In this post, we presented StarTree as a managed solution on AWS for teams seeking the advantages of Apache Pinot. Like Pinot, StarTree addresses the need for a low-latency, high-concurrency, real-time online analytical processing (OLAP) solution. In addition, StarTree offers a managed experience for real-time and batch Pinot workloads, offering enhanced security, automated data ingestion, tiered storage, and off-heap upserts. These features improve security, scalability, and manageablity for organizations looking to run Pinot in production.

Developers interested in learning more about managed Pinot can deploy real-time analytics with StarTree to test it out or join a session with StarTree’s head of product. StarTree is an AWS ISVA partner and is available on AWS Marketplace.

About the Authors

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Francisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.

Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily partners with airlines, manufacturers, and retail organizations to support them to achieve their business objectives with well-architected data platforms.

Renee Berry is a Senior Partner Development Manager with the AWS Global Startup Program, working with venture backed startups partnering with AWS to scale their growth.

Mayank Shrivastava is a founding engineer of Apache Pinot and a PMC member for the project. He is currently a Fellow at StarTree Inc., where he also heads their Center of Excellence.

Barkha Herman is a technologist and developer advocate who founded WiTVoices and South Florida Women in Tech. She fosters inclusive tech communities.

Express brokers for Amazon MSK: Turbo-charged Kafka scaling with up to 20 times faster performance

2025-03-07 Masudur Rahaman Sayem

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/express-brokers-for-amazon-msk-turbo-charged-kafka-scaling-with-up-to-20-times-faster-performance/

Managing and scaling data streams efficiently is a cornerstone of success for many organizations. Apache Kafka has emerged as a leading platform for real-time data streaming, offering unmatched scalability and reliability. However, setting up and scaling Kafka clusters can be challenging, requiring significant time, expertise, and resources. This is where Amazon Managed Streaming for Apache Kafka (Amazon MSK) Express brokers come into play.

Express brokers are a new broker type in Amazon MSK that are designed to simplify Kafka deployment and scaling.

In this post, we walk you through the implementation of MSK Express brokers, highlighting their core features, benefits, and best practices for rapid Kafka scaling.

Key features of MSK Express brokers

MSK Express brokers revolutionize Kafka cluster management by delivering exceptional performance and operational simplicity. With up to three times more throughput per broker, Express brokers can sustainably handle an impressive 500 MBps ingress and 1000 MBps egress on m7g.16xl instances, setting new standards for data streaming performance.

Their standout feature is their fast scaling capability—up to 20 times faster than standard Kafka brokers—allowing rapid cluster expansion within minutes. This is complemented by 90% faster recovery from failures and built-in three-way replication, providing robust reliability for mission-critical applications.

Express brokers eliminate traditional storage management responsibility by offering unlimited storage without pre-provisioning, while simplifying operations through preconfigured best practices and automated cluster management. With full compatibility with existing Kafka APIs and comprehensive monitoring through Amazon CloudWatch and Prometheus, MSK Express brokers provide an ideal solution for organizations seeking a highly-performant and low-maintenance data streaming infrastructure.

Comparison with traditional Kafka deployment

Although Kafka provides robust fault-tolerance mechanisms, its traditional architecture, where brokers store data locally on attached storage volumes, can lead to several issues impacting the availability and resiliency of the cluster. The following diagram compares the deployment architecture.

Comparison with traditional Kafka deployment

The traditional architecture comes with the following limitations:

Extended recovery times – When a broker fails, recovery requires copying data from surviving replicas to the newly assigned broker. This replication process can be time-consuming, particularly for high-throughput workloads or in cases where recovery requires a new volume, resulting in extended recovery periods and reduced system availability.
Suboptimal load distribution – Kafka achieves load balancing by redistributing partitions across brokers. However, this rebalancing operation can strain system resources and take considerable time due to the volume of data that must be transferred between nodes.
Complex scaling operations – Expanding a Kafka cluster requires adding brokers and redistributing existing partitions across the new nodes. For large clusters with substantial data volumes, this scaling operation can impact performance and require significant time to complete.

MSK Express brokers offers fully managed and highly available Regional Kafka storage. This significantly decouples compute and storage resources, addressing the aforementioned challenges and improving the availability and resiliency of Kafka clusters. The benefits include:

Faster and more reliable broker recovery – When Express brokers recover, they do so in up to 90% less time than standard brokers and place negligible strain on the clusters’ resources, which makes recovery faster and more reliable.
Efficient load balancing – Load balancing in MSK Express brokers is faster and less resource-intensive, enabling more frequent and seamless load balancing operations.
Faster scaling – MSK Express brokers enable efficient cluster scaling through rapid broker addition, minimizing data transfer overhead and partition rebalancing time. New brokers become operational quickly due to accelerated catch-up processes, resulting in faster throughput improvements and minimal disruption during scaling operations.

Scaling use case example

Consider a use case requiring 300 MBps data ingestion on a Kafka topic. We implemented this using an MSK cluster with three m7g.4xlarge Express brokers. The configuration included a topic with 3,000 partitions and 24-hour data retention, with each broker initially managing 1,000 partitions.

To prepare for anticipated midday peak traffic, we needed to double the cluster capacity. This scenario highlights one of Express brokers’ key advantages: rapid, safe scaling without disrupting application traffic or requiring extensive advance planning. During this scenario, the cluster was actively handling approximately 300 MBps of ingestion. The following graph shows the total ingress on this cluster and the number of partitions it is holding across three brokers.

Scaling use case example

The scaling process involved two main steps:

Adding three additional brokers to the cluster, which completed in approximately 18 minutes
Using Cruise Control to redistribute the 3,000 partitions evenly across all six brokers, which took about 10 minutes

Scaling use case example

As shown in the following graph, the scaling operation completed smoothly, with partition rebalancing occurring rapidly across all six brokers while maintaining uninterrupted producer traffic.

Scaling use case example

Notably, throughout the entire process, we observed no disruption to producer traffic. The entire operation to double the cluster’s capacity was completed in just 28 minutes, demonstrating MSK Express brokers’ ability to scale efficiently with minimal impact on ongoing operations.

Best practices

Consider the following guidelines to adopt MSK Express brokers:

When implementing new streaming workloads on Kafka, select MSK Express brokers as your default option. If uncertain about your workload requirements, begin with express.m7g.large instances.
Use the Amazon MSK sizing tool to calculate optimal broker count and type for your workload. Although this provides a good baseline, always validate through load testing that simulates your real-world usage patterns.
Review and implement MSK Express broker best practices.
Choose larger instance types for high-throughput workloads. A smaller number of large instances is preferable to many smaller instances, because fewer total brokers can simplify cluster management operations and reduce operational overhead.

Conclusion

MSK Express brokers represent a significant advancement in Kafka deployment and management, offering a compelling solution for organizations seeking to modernize their data streaming infrastructure. Through its innovative architecture that decouples compute and storage, MSK Express brokers deliver simplified operations, superior performance, and rapid scaling capabilities.

The key advantages demonstrated throughout this post—including 3 times higher throughput, 20 times faster scaling, and 90% faster recovery times—make MSK Express brokers an attractive option for both new Kafka implementations and migrations from traditional deployments.

As organizations continue to face growing demands for real-time data processing, MSK Express brokers provide a future-proof solution that combines the reliability of Kafka with the operational simplicity of a fully managed service.

To get started, refer to Amazon MSK Express brokers.

About the Author

Masudur Rahaman Sayem is a Streaming Data Architect at AWS with over 25 years of experience in the IT industry. He collaborates with AWS customers worldwide to architect and implement sophisticated data streaming solutions that address complex business challenges. As an expert in distributed computing, Sayem specializes in designing large-scale distributed systems architecture for maximum performance and scalability. He has a keen interest and passion for distributed architecture, which he applies to designing enterprise-grade solutions at internet scale.

Governing streaming data in Amazon DataZone with the Data Solutions Framework on AWS

2025-02-27 Vincent Gromakowski

Post Syndicated from Vincent Gromakowski original https://aws.amazon.com/blogs/big-data/governing-streaming-data-in-amazon-datazone-with-the-data-solutions-framework-on-aws/

Effective data governance has long been a critical priority for organizations seeking to maximize the value of their data assets. It encompasses the processes, policies, and practices an organization uses to manage its data resources. The key goals of data governance are to make data discoverable and usable by those who need it, accurate and consistent, secure and protected from unauthorized access or misuse, and compliant with relevant regulations and standards. Data governance involves establishing clear ownership and accountability for data, including defining roles, responsibilities, and decision-making authority related to data management.

Traditionally, data governance frameworks have been designed to manage data at rest—the structured and unstructured information stored in databases, data warehouses, and data lakes. Amazon DataZone is a data governance and catalog service from Amazon Web Services (AWS) that allows organizations to centrally discover, control, and evolve schemas for data at rest including AWS Glue tables on Amazon Simple Storage Service (Amazon S3), Amazon Redshift tables, and Amazon SageMaker models.

However, the rise of real-time data streams and streaming data applications impacts data governance, necessitating changes to existing frameworks and practices to effectively manage the new data dynamics. Governing these rapid, decentralized data streams presents a new set of challenges that extend beyond the capabilities of many conventional data governance approaches. Factors such as the ephemeral nature of streaming data, the need for real-time responsiveness, and the technical complexity of distributed data sources require a reimagining of how we think about data oversight and control.

In this post, we explore how AWS customers can extend Amazon DataZone to support streaming data such as Amazon Managed Streaming for Apache Kafka (Amazon MSK) topics. Developers and DevOps managers can use Amazon MSK, a popular streaming data service, to run Kafka applications and Kafka Connect connectors on AWS without becoming experts in operating it. We explain how they can use Amazon DataZone custom asset types and custom authorizers to: 1) catalog Amazon MSK topics, 2) provide useful metadata such as schema and lineage, and 3) securely share Amazon MSK topics across the organization. To accelerate the implementation of Amazon MSK governance in Amazon DataZone, we use the Data Solutions Framework on AWS (DSF), an opinionated open source framework that we announced earlier this year. DSF relies on AWS Cloud Development Kit (AWS CDK) and provides several AWS CDK L3 constructs that accelerate building data solutions on AWS, including streaming governance.

High-level approach for governing streaming data in Amazon DataZone

To anchor the discussion on supporting streaming data in Amazon DataZone, we use Amazon MSK as an integration example, but the approach and the architectural patterns remain the same for other streaming services (such as Amazon Kinesis Data Streams). At a high level, to integrate streaming data, you need the following capabilities:

A mechanism for the Kafka topic to be represented in the Amazon DataZone catalog for discoverability (including the schema of the data flowing inside the topic), tracking of lineage and other metadata, and for consumers to request access against.
A mechanism to handle the custom authorization flow when a consumer triggers the subscription grant to an environment. This flow consists of the following high-level steps:
- Collect metadata of target Amazon MSK cluster or topic that’s being subscribed to by the consumer
- Update the producer Amazon MSK cluster’s resource policy to allow access from the consumer role
- Provide Kafka topic level AWS Identity and Access Management (IAM) permission to the consumer roles (more on this later) so that it has access to the target Amazon MSK cluster
- Finally, update the internal metadata of Amazon DataZone so that it’s aware of the existing subscription between producer and consumer

Amazon DataZone catalog

Before you can represent the Kafka topic as an entry in the Amazon DataZone catalog, you need to define:

A custom asset type that describes the metadata that’s needed to describe a Kafka topic. To describe the schema as part of the metadata, use the built-in form type amazon.datazone.RelationalTableFormType and create two more custom form types:

1. MskSourceReferenceFormType that contains the cluster_ARN and the cluster_type. The type is used to determine whether the Amazon MSK cluster is provisioned or serverless, given that there’s a different process to grant consume permissions.

1. KafkaSchemaFormType, which contains various metadata on the schema, including the kafka_topic, the schema_version, schema_arn, registry_arn, compatibility_mode (for example, backward-compatible or forward-compatible) and data_format (for example, Avro or JSON), which is helpful if you plan to integrate with the AWS Glue Schema registry.

After the custom asset type has been defined, you can now create an asset based on the custom asset type. The asset describes the schema, the Amazon MSK cluster, and the topic that you want to be made discoverable and accessible to consumers.

Data source for Amazon MSK clusters with AWS Glue Schema registry

In Amazon DataZone, you can create data sources for AWS Glue Data Catalog to import technical metadata of database tables from AWS Glue and have the assets registered in the Amazon DataZone project. For importing metadata related to Amazon MSK, you need to use a custom data source, which can be an AWS Lambda function, using the Amazon DataZone APIs.

We provide as part of the solution a custom Amazon MSK data source with the AWS Glue Schema registry, for automating the creation, update, and deletion of custom Amazon MSK assets. It uses AWS Lambda to extract schema definitions from a Schema registry and metadata from the Amazon MSK clusters and then creates or updates the corresponding assets in Amazon DataZone.

Before explaining how the data source works, you need to know that every custom asset in Amazon DataZone has a unique identifier. When the data source creates an asset, it stores the asset’s unique identifier in Parameter Store, a capability of AWS Systems Manager.

The steps for how the data source works are as follows:

The Amazon MSK AWS Glue Schema registry data source can be scheduled to be triggered on a given interval or by listening to AWS Glue Schema events such as Create, Update or Delete Schema. It can also be invoked manually through the AWS Lambda console.
When triggered, it retrieves all the existing unique identifiers from Parameter Store. These parameters serve as reference to identify if an Amazon MSK asset already exists in Amazon DataZone.
The function lists the Amazon MSK clusters and retrieves the Amazon Resource Name (ARN) for the given Amazon MSK name and additional metadata related to the Amazon MSK cluster type (serverless or provisioned). This metadata will be used later by the custom authorization flow.
Then the function lists all the schemas in the Schema registry for a given registry name. For each schema, it retrieves the latest version and schema definition. The schema definition is what will allow you to add schema information when creating the asset in Amazon DataZone.
For each schema retrieved in the Schema registry, the Lambda function checks if the assets already exist by looking into the Systems Manager parameters retrieved in the second step.
1. If the asset exists, the Lambda function updates the asset in Amazon DataZone, creating a new revision with the updated schema or forms.
2. If the asset doesn’t exist, the Lambda function creates the asset in Amazon DataZone and stores its unique identifier in Systems Manager for future reference.
If there are schemas registered in Parameter Store that are no longer in the Schema registry, the data source deletes the corresponding Amazon DataZone assets and removes the associated parameters from Systems Manager.

The Amazon MSK AWS Glue Schema registry data source for Amazon DataZone enables seamless registration of Kafka topics as custom assets in Amazon DataZone. It does require that the topics in the Amazon MSK cluster are using the Schema registry for schema management.

Custom authorization flow

For managed assets such as AWS Glue Data Catalog and Amazon Redshift assets, the process to grant access to the consumer is managed by Amazon DataZone. Custom asset types are considered unmanaged assets, and the process to grant access needs to be implemented outside of Amazon DataZone.

The high-level steps for the end-to-end flow are as follows:

(Conditional) If the consumer environment doesn’t have a subscription target, create it through the CreateSubscriptionTarget API call. The subscription target tells Amazon DataZone which environments are compatible with an asset type.
The consumer triggers a subscription request by subscribing to the relevant streaming data asset through the Amazon DataZone portal.
The producer receives the subscription request and approves (or denies) the request.
After the subscription request has been approved by the producer, the consumer can observe the streaming data asset in their project under the Subscribed data section.
The consumer can opt to trigger a subscription grant to a target environment directly from the Amazon DataZone portal, and this action triggers the custom authorization flow.

For steps 2–4, you rely on the default behavior of Amazon DataZone and no change is required. The focus of this section is then step 1 (subscription target) and step 5 (subscription grant process).

Subscription target

Amazon DataZone has a concept called environments within a project, which indicates where the resources are located and the related access configuration (for example, the IAM role) that is used to access those resources. To allow an environment to have access to the custom asset type, consumers have to use the Amazon DataZone CreateSubscriptionTarget API prior to the subscription grants. The creation of the subscription target is a one-time operation per custom asset type per environment. In addition, the authorizedPrincipals parameter inside the CreateSubscriptionTarget API lists the various IAM principals given access to the Amazon MSK topic as part of the grant authorization flow. Lastly, when calling CreateSubscriptionTarget, the underlying principle used to call the API must belong to the target environment’s AWS account ID.

After the subscription target has been created for a custom asset type and environment, the environment is eligible as a target for subscription grants.

Subscription grant process

Amazon DataZone emits events based on user actions, and you use this mechanism to trigger the custom authorization process when a subscription grant has been triggered for Amazon MSK topics. Specifically, you use the Subscription grant requested event. These are the steps of the authorization flow:

A Lambda function collects metadata on the following:
1. Producer Amazon MSK cluster or Kinesis data stream that the consumer is requesting access to. Metadata is collected using the GetListing API.
2. Metadata about the target environment using a call to GetEnvironment API.
3. Metadata about the subscription target using a call to GetSubscriptionTarget API to collect the consumer roles to grant.
4. In parallel, Amazon DataZone internal metadata about the status of the subscription grant needs to be updated, and this happens in this step. Depending on the type of action that’s being done (such as GRANT or REVOKE), the status of the subscription grant is updated respectively (for example, GRANT_IN_PROGRESS, REVOKE_IN_PROGRESS).

After the metadata has been collected, it’s passed downstream as part of the AWS Step Functions state.

Update the resource policy of the target resource (for example, Amazon MSK cluster or Kinesis data stream) in the producer account. The update allows authorized principals from the consumer to access or read the target resource. Example of the policy is as follows:

{
    "Effect": "Allow",
    "Principal": {
        "AWS": [
            "<CONSUMER_ROLES_ARN>"
        ]
    },
    "Action": [
        'kafka-cluster:Connect',
        'kafka-cluster:DescribeTopic',
        'kafka-cluster:DescribeGroup',
        'kafka-cluster:AlterGroup',
        'kafka-cluster:ReadData',
        "kafka:CreateVpcConnection",
        "kafka:GetBootstrapBrokers",
        "kafka:DescribeCluster",
        "kafka:DescribeClusterV2"
    ],
    "Resource": [
        "<CLUSTER_ARN>",
        "<TOPIC_ARN>",
        "<GROUP_ARN>"
    ]
}

Update the configured authorized principals by attaching additional IAM permissions depending on specific scenarios. The following examples illustrate what’s being added.

The base access or read permissions are as follows:

{
    "Effect": "Allow",
    "Action": [
        'kafka-cluster:Connect',
        'kafka-cluster:DescribeTopic',
        'kafka-cluster:DescribeGroup',
        'kafka-cluster:AlterGroup',
        'kafka-cluster:ReadData'
    ],
    "Resource": [
        "<CLUSTER_ARN>",
        "<TOPIC_ARN>",
        "<GROUP_ARN>"
    ]
}

If there’s an AWS Glue Schema registry ARN provided as part of the AWS CDK construct parameter, then additional permissions are added to allow access to both the registry and the specific schema:

{
    "Effect": "Allow",
    "Action": [
        "glue:GetRegistry",
        "glue:ListRegistries",
        "glue:GetSchema",
        "glue:ListSchemas",
        "glue:GetSchemaByDefinition",
        "glue:GetSchemaVersion",
        "glue:ListSchemaVersions",
        "glue:GetSchemaVersionsDiff",
        "glue:CheckSchemaVersionValidity",
        "glue:QuerySchemaVersionMetadata",
        "glue:GetTags"
    ],
    "Resource": [
        "<REGISTRY_ARN>",
        "<SCHEMA_ARN>"
    ]
}

If this grant is for a consumer in a different account, the following permissions are also added to allow managed VPC connections to be created by the consumer:

{
    "Effect": "Allow",
    "Action": [
        "kafka:CreateVpcConnection",
        "ec2:CreateTags",
        "ec2:CreateVPCEndpoint"
    ],
    "Resource": "*"
}

Update the Amazon DataZone internal metadata on the progress of the subscription grant (for example, GRANTED or REVOKED). If there’s an exception in a step, it’s handled inside Step Functions and the subscription grant metadata is updated with a failed state (for example, GRANT_FAILED or REVOKE_FAILED).

Because Amazon DataZone supports multi-account architecture, the subscription grant process is a distributed workflow that needs to perform actions across different accounts, and it’s orchestrated from the Amazon DataZone domain account where all the events are received.

Implement streaming governance in Amazon DataZone with DSF

In this section, we deploy an example to illustrate the solution using DSF on AWS, which provides all the required components to accelerate the implementation of the solution. We use the following CDK L3 constructs from DSF:

DataZoneMskAssetType creates the custom asset type representing an Amazon MSK topic in Amazon DataZone
DataZoneGsrMskDataSource automatically creates Amazon MSK topic assets in Amazon DataZone based on schema definitions in the Schema registry
DataZoneMskCentralAuthorizer and DataZoneMskEnvironmentAuthorizer implement the subscription grant process for Amazon MSK topics and IAM authentication

The following diagram is the architecture for the solution.

In this example, we use Python for the example code. DSF also supports TypeScript.

Deployment steps

Follow the steps in the data-solutions-framework-on-aws README to deploy the solution. You need to deploy the CDK stack first, then create the custom environment and redeploy the stack with additional information.

Verify the example is working

To verify the example is working, produce sample data using the Lambda function StreamingGovernanceStack-ProducerLambda. Follow these steps:

Use the AWS Lambda console to test the Lambda function by running a sample test event. The event JSON should be empty. Save your test event and click Test.

Producing test events will generate a new schema producer-data-product in the Schema registry. Check the schema is created from the AWS Glue console using the Data Catalog menu from the left and selecting Stream schema registries.

New data assets should be in the Amazon DataZone portal, under the PRODUCER project
On the DATA tab, in the left navigation pane, select Inventory data, as shown in the following screenshot
Select producer-data-product

Select the BUSINESS METADATA tab to view the business metadata, as shown in the following screenshot.

To view the schema, select the SCHEMA tab, as shown in the following screenshot

To view the lineage, select the LINEAGE tab
To publish the asset, select PUBLISH ASSET, as shown in the following screenshot

To subscribe, follow these steps:

Switch to the consumer project by selecting CONSUMER in the top left of the screen
Select Browse Catalog
Choose producer-data-product and choose SUBSCRIBE, as shown in the following screenshot

subscription

Return to the PRODUCER project and choose producer-data-product, as shown in the following screenshot

Choose APPROVE, as shown in the following screenshot

subscription grant approval

Go to the AWS Identity and Access Management (IAM) console and search for the consumer role. In the role definition, you should see an IAM inline policy with permissions on the Amazon MSK cluster, the Kafka topic, the Kafka consumer group, the AWS Glue schema registry and the schema from the producer.

Now let’s switch to the consumer’s environment in the Amazon Managed Service for Apache Flink console and run the Flink application called flink-consumer using the Run button.

Go back to the Amazon DataZone portal, and confirm that the lineage under the CONSUMER project was updated and the new Flink job run node was added to the lineage graph, as shown in the following screenshot

Clean up

To clean up the resources you created as part of this walkthrough, follow these steps:

Stop the Amazon Managed Streaming for Apache Flink job.
Revoke the subscription grant from the Amazon DataZone console.
Run cdk destroy in your local terminal to delete the stack. Because you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to remove data on destroy, running cdk destroy or deleting the stack from the AWS CloudFormation console will clean up the provisioned resources.

Conclusion

In this post, we shared how you can integrate streaming data from Amazon MSK within Amazon DataZone to create a unified data governance framework that spans the entire data lifecycle, from the ingestion of streaming data to its storage and eventual consumption by diverse producers and consumers.

We also demonstrated how to use the AWS CDK and the DSF on AWS to quickly implement this solution using built-in best practices. In addition to the Amazon DataZone streaming governance, DSF supports other patterns, such as Spark data processing and Amazon Redshift data warehousing. Our roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback. You can get started using DSF by following our Quick start guide.

About the Authors

Vincent Gromakowski is a Principal Analytics Solutions Architect at AWS where he enjoys solving customers’ data challenges. He uses his strong expertise on analytics, distributed systems and resource orchestration platform to be a trusted technical advisor for AWS customers.

Francisco Morillo is a Sr. Streaming Solutions Architect at AWS, specializing in real-time analytics architectures. With over five years in the streaming data space, Francisco has worked as a data analyst for startups and as a big data engineer for consultancies, building streaming data pipelines. He has deep expertise in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink. Francisco collaborates closely with AWS customers to build scalable streaming data solutions and advanced streaming data lakes, ensuring seamless data processing and real-time insights.

Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Sofia Zilberman is a Sr. Analytics Specialist Solutions Architect at Amazon Web Services. She has a track record of 15 years of creating large-scale, distributed processing systems. She remains passionate about big data technologies and architecture trends, and is constantly on the lookout for functional and technological innovations.

Amazon Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

2025-02-26 William Vambenepe

Post Syndicated from William Vambenepe original https://aws.amazon.com/blogs/big-data/amazon-web-services-named-a-leader-in-the-2024-gartner-magic-quadrant-for-data-integration-tools/

Amazon Web Services (AWS) has been recognized as a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools. We were positioned in the Challengers Quadrant in 2023.

This recognition, we feel, reflects our ongoing commitment to innovation and excellence in data integration, demonstrating our continued progress in providing comprehensive data management solutions.

The Gartner Magic Quadrant evaluates 20 data integration tool vendors based on two axes—Ability to Execute and Completeness of Vision. This evaluation, we feel, critically examines vendors’ capabilities to address key service needs, including data engineering, operational data integration, modern data architecture delivery, and enabling less-technical data integration across various deployment models.

Discover, prepare, and integrate all your data at any scale

AWS Glue is a fully managed, serverless data integration service that simplifies data preparation and transformation across diverse data sources. With its comprehensive suite of tools, AWS Glue allows users to build and manage data pipelines efficiently, without requiring extensive infrastructure management expertise.

Given the diverse data integration needs of customers, AWS offers a robust data integration system through multiple services including Amazon EMR, Amazon Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis, and others. Many thousands of customers across various industries are using these services to transform, operationalize, and manage their data across data lakes and data warehouses.

We have embarked on a journey to unify the broad range of AWS data processing, analytics, and AI capabilities, starting with the announcement of Amazon SageMaker Unified Studio at re:Invent 2024. This includes the data integration capabilities mentioned above, with support for both structured and unstructured data. With an integrated experience for data workers, SageMaker Unified Studio provides an environment where users can collaborate and build faster. It supports model development, generative AI, data processing, and SQL analytics, all accelerated by Amazon Q Developer—our most capable generative AI assistant for software development. The Unified Studio provides unified access to all data sources, whether stored in data lakes, data warehouses, or third-party and federated sources, with robust governance and enterprise-grade security built-in.

Review the Gartner Magic Quadrant

Access a complimentary copy of the full report to see why Gartner positioned AWS as a Leader, and dive deep into the strengths and cautions of AWS. We believe our recognition as a Leader in the Gartner Magic Quadrant is a testament to delivering innovations for our customers.

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from here.

About the authors

William Vambenepe is Director of Product Management at AWS, where he leads the Product Management, Solutions Engineering, and UX Design team for data processing services (Amazon EMR, AWS Glue, Athena, Amazon MWAA), SageMaker Unified Studio, and SageMaker Catalog. Prior to AWS, William worked at Google (6 years building and growing the Data and Analytics product portfolio for Google Cloud, and 5 years as Product Management Director for Google Search). He had previously held software engineering leadership roles at Oracle and HP. William holds an Engineering degree from Ecole Centrale Paris, a graduate Diploma in Computer Science from Cambridge University, and a Masters in Engineering Management from Stanford University.

Santosh Chandrachood has been with AWS for over 8 years and helped build, launch, and scale a variety of AWS services. Currently, Santosh is Director and service leader for Data Processing, managing Amazon EMR, Athena, AWS Glue, and Managed Workflows for Apache Airflow (Amazon MWAA). Santosh also led AWS Data Integration as the General Manager. Before joining AWS, Santosh lead engineering teams in networking, storage, and data infrastructure areas.

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

2025-02-14 Subham Rakshit

Post Syndicated from Subham Rakshit original https://aws.amazon.com/blogs/big-data/migrate-from-standard-brokers-to-express-brokers-in-amazon-msk-using-amazon-msk-replicator/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) now offers a new broker type called Express brokers. It’s designed to deliver up to 3 times more throughput per broker, scale up to 20 times faster, and reduce recovery time by 90% compared to Standard brokers running Apache Kafka. Express brokers come preconfigured with Kafka best practices by default, support Kafka APIs, and provide the same low latency performance that Amazon MSK customers expect, so you can continue using existing client applications without any changes. Express brokers provide straightforward operations with hands-free storage management by offering unlimited storage without pre-provisioning, eliminating disk-related bottlenecks. To learn more about Express brokers, refer to Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters.

Creating a new cluster with Express brokers is straightforward, as described in Amazon MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

MSK Replicator offers a built-in replication capability to seamlessly replicate data from one cluster to another. It automatically scales the underlying resources, so you can replicate data on demand without having to monitor or scale capacity. MSK Replicator also replicates Kafka metadata, including topic configurations, access control lists (ACLs), and consumer group offsets.

In the following sections, we discuss how to use MSK Replicator to replicate the data from a Standard broker MSK cluster to an Express broker MSK cluster and the steps involved in migrating the client applications from the old cluster to the new cluster.

Planning your migration

Migrating from Standard brokers to Express brokers requires thorough planning and careful consideration of various factors. In this section, we discuss key aspects to address during the planning phase.

Assessing the source cluster’s infrastructure and needs

It’s crucial to evaluate the capacity and health of the current (source) cluster to make sure it can handle additional consumption during migration, because MSK Replicator will retrieve data from the source cluster. Key checks include:

- CPU utilization – The combined CPU User and CPU System utilization per broker should remain below 60%.
- Network throughput – The cluster-to-cluster replication process adds extra egress traffic, because it might need to replicate the existing data based on business requirements along with the incoming data. For instance, if the ingress volume is X GB/day and data is retained in the cluster for 2 days, replicating the data from the earliest offset would cause the total egress volume for replication to be 2X GB. The cluster must accommodate this increased egress volume.

Let’s take an example where in your existing source cluster you have an average data ingress of 100 MBps and peak data ingress of 400 MBps with retention of 48 hours. Let’s assume you have one consumer of the data you produce to your Kafka cluster, which means that your egress traffic will be same compared to your ingress traffic. Based on this requirement, you can use the Amazon MSK sizing guide to calculate the broker capacity you need to safely handle this workload. In the spreadsheet, you will need to provide your average and maximum ingress/egress traffic in the cells, as shown in the following screenshot.

Because you need to replicate all the data produced in your Kafka cluster, the consumption will be higher than the regular workload. Taking this into account, your overall egress traffic will be at least twice the size of your ingress traffic.
However, when you run a replication tool, the resulting egress traffic will be higher than twice the ingress because you also need to replicate the existing data along with the new incoming data in the cluster. In the preceding example, you have an average ingress of 100 MBps and you retain data for 48 hours, which means that you have a total of approximately 18 TB of existing data in your source cluster that needs to be copied over on top of the new data that’s coming through. Let’s further assume that your goal for the replicator is to catch up in 30 hours. In this case, your replicator needs to copy data at 260 MBps (100 MBps for ingress traffic + 160 MBps (18 TB/30 hours) for existing data) to catch up in 30 hours. The following figure illustrates this process.

Therefore, in the sizing guide’s egress cells, you need to add an additional 260 MBps to your average data out and peak data out to estimate the size of the cluster you should provision to complete the replication safely and on time.

Replication tools act as a consumer to the source cluster, so there is a chance that this replication consumer can consume higher bandwidth, which can negatively impact the existing application client’s produce and consume requests. To control the replication consumer throughput, you can use a consumer-side Kafka quota in the source cluster to limit the replicator throughput. This makes sure that the replicator consumer will throttle when it goes beyond the limit, thereby safeguarding the other consumers. However, if the quota is set too low, the replication throughput will suffer and the replication might never end. Based on the preceding example, you can set a quota for the replicator to be at least 260 MBps, otherwise the replication will not finish in 30 hours.

Volume throughput – Data replication might involve reading from the earliest offset (based on business requirement), impacting your primary storage volume, which in this case is Amazon Elastic Block Store (Amazon EBS). The VolumeReadBytes and VolumeWriteBytes metrics should be checked to make sure the source cluster volume throughput has additional bandwidth to handle any additional read from the disk. Depending on the cluster size and replication data volume, you should provision storage throughput in the cluster. With provisioned storage throughput, you can increase the Amazon EBS throughput up to 1000 MBps depending on the broker size. The maximum volume throughput can be specified depending on broker size and type, as mentioned in Manage storage throughput for Standard brokers in a Amazon MSK cluster. Based on the preceding example, the replicator will start reading from the disk and the volume throughput of 260 MBps will be shared across all the brokers. However, existing consumers can lag, which will cause reading from the disk, thereby increasing the storage read throughput. Also, there is storage write throughput due to incoming data from the producer. In this scenario, enabling provisioned storage throughput will increase the overall EBS volume throughput (read + write) so that existing producer and consumer performance doesn’t get impacted due to the replicator reading data from EBS volumes.
Balanced partitions – Make sure partitions are well-distributed across brokers, with no skewed leader partitions.

Depending on the assessment, you might need to vertically scale up or horizontally scale out the source cluster before migration.

Assessing the target cluster’s infrastructure and needs

Use the same sizing tool to estimate the size of your Express broker cluster. Typically, fewer Express brokers might be needed compared to Standard brokers for the same workload because depending on the instance size, Express brokers allow up to three times more ingress throughput.

Configuring Express Brokers

Express brokers employ opinionated and optimized Kafka configurations, so it’s important to differentiate between configurations that are read-only and those that are read/write during planning. Read/write broker-level configurations should be configured separately as a pre-migration step in the target cluster. Although MSK Replicator will replicate most topic-level configurations, certain topic-level configurations are always set to default values in an Express cluster: replication-factor, min.insync.replicas, and unclean.leader.election.enable. If the default values differ from the source cluster, these configurations will be overridden.

As part of the metadata, MSK Replicator also copies certain ACL types, as mentioned in Metadata replication. It doesn’t explicitly copy the write ACLs except the deny ones. Therefore, if you’re using SASL/SCRAM or mTLS authentication with ACLs rather than AWS Identity and Access Management (IAM) authentication, write ACLs need to be explicitly created in the target cluster.

Client connectivity to the target cluster

Deployment of the target cluster can occur within the same virtual private cloud (VPC) or a different one. Consider any changes to client connectivity, including updates to security groups and IAM policies, during the planning phase.

Migration strategy: All at once vs. wave

Two migration strategies can be adopted:

All at once – All topics are replicated to the target cluster simultaneously, and all clients are migrated at once. Although this approach simplifies the process, it generates significant egress traffic and involves risks to multiple clients if issues arise. However, if there is any failure, you can roll back by redirecting the clients to use the source cluster. It’s recommended to perform the cutover during non-business hours and communicate with stakeholders beforehand.
Wave – Migration is broken into phases, moving a subset of clients (based on business requirements) in each wave. After each phase, the target cluster’s performance can be evaluated before proceeding. This reduces risks and builds confidence in the migration but requires meticulous planning, especially for large clusters with many microservices.

Each strategy has its pros and cons. Choose the one that aligns best with your business needs. For insights, refer to Goldman Sachs’ migration strategy to move from on-premises Kafka to Amazon MSK.

Cutover plan

Although MSK Replicator facilitates seamless data replication with minimal downtime, it’s essential to devise a clear cutover plan. This includes coordinating with stakeholders, stopping producers and consumers in the source cluster, and restarting them in the target cluster. If a failure occurs, you can roll back by redirecting the clients to use the source cluster.

Schema registry

When migrating from a Standard broker to an Express broker cluster, schema registry considerations remain unaffected. Clients can continue using existing schemas for both producing and consuming data with Amazon MSK.

Solution overview

In this setup, two Amazon MSK provisioned clusters are deployed: one with Standard brokers (source) and the other with Express brokers (target). Both clusters are located in the same AWS Region and VPC, with IAM authentication enabled. MSK Replicator is used to replicate topics, data, and configurations from the source cluster to the target cluster. The replicator is configured to maintain identical topic names across both clusters, providing seamless replication without requiring client-side changes.

During the first phase, the source MSK cluster handles client requests. Producers write to the clickstream topic in the source cluster, and a consumer group with the group ID clickstream-consumer reads from the same topic. The following diagram illustrates this architecture.

When data replication to the target MSK cluster is complete, we need to evaluate the health of the target cluster. After confirming the cluster is healthy, we need to migrate the clients in a controlled manner. First, we need to stop the producers, reconfigure them to write to the target cluster, and then restart them. Then, we need to stop the consumers after they have processed all remaining records in the source cluster, reconfigure them to read from the target cluster, and restart them. The following diagram illustrates the new architecture.

After verifying that all clients are functioning correctly with the target cluster using Express brokers, we can safely decommission the source MSK cluster with Standard brokers and the MSK Replicator.

Deployment Steps

In this section, we discuss the step-by-step process to replicate data from an MSK Standard broker cluster to an Express broker cluster using MSK Replicator and also the client migration strategy. For the purpose of the blog, “all at once” migration strategy is used.

Provision the MSK cluster

Download the AWS CloudFormation template to provision the MSK cluster. Deploy the following in us-east-1 with stack name as migration.

This will create the VPC, subnets, and two Amazon MSK provisioned clusters: one with Standard brokers (source) and another with Express brokers (target) within the VPC configured with IAM authentication. It will also create a Kafka client Amazon Elastic Compute Cloud (Amazon EC2) instance where from we can use the Kafka command line to create and view Kafka topics and produce and consume messages to and from the topic.

Configure the MSK client

On the Amazon EC2 console, connect to the EC2 instance named migration-KafkaClientInstance1 using Session Manager, a capability of AWS Systems Manager.

After you log in, you need to configure the source MSK cluster bootstrap address to create a topic and publish data to the cluster. You can get the bootstrap address for IAM authentication from the details page for the MSK cluster (migration-standard-broker-src-cluster) on the Amazon MSK console, under View Client Information. You also need to update the producer.properties and consumer.properties files to reflect the bootstrap address of the standard broker cluster.

sudo su - ec2-user

export BS_SRC=<<SOURCE_MSK_BOOTSTRAP_ADDRESS>>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=/BOOTSTRAP_SERVERS_CONFIG=${BS_SRC}/g" producer.properties 
sed -i "s/bootstrap.servers=/bootstrap.servers=${BS_SRC}/g" consumer.properties

Create a topic

Create a clickstream topic using the following commands:

/home/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_SRC \
--create --replication-factor 3 --partitions 3 \
--topic clickstream \
--command-config=/home/ec2-user/kafka/config/client_iam.properties

Produce and consume messages to and from the topic

Run the clickstream producer to generate events in the clickstream topic:

cd /home/ec2-user/clickstream-producer-for-apache-kafka/

java -jar target/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/producer.properties -nt 8 -rf 3600 -iam \
-gsr -gsrr <<REGION>> -grn default-registry -gar

Open another Session Manager instance and from that shell, run the clickstream consumer to consume from the topic:

cd /home/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar target/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/consumer.properties -nt 3 -rf 3600 -iam \
-gsr -gsrr <<REGION>> -grn default-registry

Keep the producer and consumer running. If not interrupted, the producer and consumer will run for 60 minutes before it exits. The -rf parameter controls how long the producer and consumer will run.

Create an MSK replicator

To create an MSK replicator, complete the following steps:

On the Amazon MSK console, choose Replicators in the navigation pane.
Choose Create replicator.
In the Replicator details section, enter a name and optional description.

In the Source cluster section, provide the following information:
1. For Cluster region, choose us-east-1.
2. For MSK cluster, enter the MSK cluster Amazon Resource Name (ARN) for the Standard broker.

After the source cluster is selected, it automatically selects the subnets associated with the primary cluster and the security group associated with the source cluster. You can also select additional security groups.

Make sure that the security groups have outbound rules to allow traffic to your cluster’s security groups. Also make sure that your cluster’s security groups have inbound rules that accept traffic from the replicator security groups provided here.

In the Target cluster section, for MSK cluster¸ enter the MSK cluster ARN for the Express broker.

After the target cluster is selected, it automatically selects the subnets associated with the primary cluster and the security group associated with the source cluster. You can also select additional security groups.

Now let’s provide the replicator settings.

In the Replicator settings section, provide the following information:
1. For the purpose of the example, we have kept the topics to replicate as a default value that would replicate all topics from primary to secondary cluster.
2. For Replicator starting position, we configure it to replicate from the earliest offset, so that we can get all the events from the start of the source topics.
3. To configure the topic name in the secondary cluster as identical to the primary cluster, we select Keep the same topic names for Copy settings. This makes sure that the MSK clients don’t need to add a prefix to the topic names.

1. For this example, we keep the Consumer Group Replication setting as default (make sure it’s enabled to allow redirected clients resume processing data from the last processed offset).
2. We set Target Compression type as None.

The Amazon MSK console will automatically create the required IAM policies. If you’re deploying using the AWS Command Line Interface (AWS CLI), SDK, or AWS CloudFormation, you have to create the IAM policy and use it as per your deployment process.

Choose Create to create the replicator.

The process will take around 15–20 minutes to deploy the replicator. When the MSK replicator is running, this will be reflected in the status.

Monitor replication

When the MSK replicator is up and running, monitor the MessageLag metric. This metric indicates how many messages are yet to be replicated from the source MSK cluster to the target MSK cluster. The MessageLag metric should come down to 0.

Migrate clients from source to target cluster

When the MessageLag metric reaches 0, it indicates that all messages have been replicated from the source MSK cluster to the target MSK cluster. At this stage, you can cut over client applications from the source to the target cluster. Before initiating this step, confirm the health of the target cluster by reviewing the Amazon MSK metrics in Amazon CloudWatch and making sure that the client applications are functioning properly. Then complete the following steps:

Stop the producers writing data to the source (old) cluster with Standard brokers and reconfigure them to write to the target (new) cluster with Express brokers.
Before migrating the consumers, make sure that the MaxOffsetLag metric for the consumers has dropped to 0, confirming that they have processed all existing data in the source cluster.
When this condition is met, stop the consumers and reconfigure them to read from the target cluster.

The offset lag happens if the consumer is consuming slower than the rate the producer is producing data. The flat line in the following metric visualization shows that the producer has stopped producing to the source cluster while the consumer attached to it continues to consume the existing data and eventually consumes all the data, therefore the metric goes to 0.

Now you can update the bootstrap address in properties and consumer.properties to point to the target Express based MSK cluster. You can get the bootstrap address for IAM authentication from the MSK cluster (migration-express-broker-dest-cluster) on the Amazon MSK console under View Client Information.

export BS_TGT=<<TARGET_MSK_BOOTSTRAP_ADDRESS>>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=.*/BOOTSTRAP_SERVERS_CONFIG=${BS_TGT}/g" producer.properties
sed -i "s/bootstrap.servers=.*/bootstrap.servers=${BS_TGT}/g" consumer.properties

Run the clickstream producer to generate events in the clickstream topic:

cd /home/ec2-user/clickstream-producer-for-apache-kafka/

java -jar target/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/producer.properties -nt 8 -rf 60 -iam \
-gsr -gsrr <<REGION>> -grn default-registry -gar

In another Session Manager instance and from that shell, run the clickstream consumer to consume from the topic:

cd /home/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar target/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/consumer.properties -nt 3 -rf 60 -iam \
-gsr -gsrr <<REGION>> -grn default-registry

We can see that the producers and consumers are now producing and consuming to the target Express based MSK cluster. The producers and consumers will run for 60 seconds before they exit.

The following screenshot shows producer-produced messages to the new Express based MSK cluster for 60 seconds.

Migrate stateful applications

Stateful applications such as Kafka Streams, KSQL, Apache Spark, and Apache Flink use their own checkpointing mechanisms to store consumer offsets instead of relying on Kafka’s consumer group offset mechanism. When migrating topics from a source cluster to a target cluster, the Kafka offsets in the source will differ from those in the target. As a result, migrating a stateful application along with its state requires careful consideration, because the existing offsets are incompatible with the target cluster’s offsets. Before migrating stateful applications, it is crucial to stop producers and make sure that consumer applications have processed all data from the source MSK cluster.

Migrate Kafka Streams and KSQL applications

Kafka Streams and KSQL store consumer offsets in internal changelog topics. It is advisable not to replicate these internal changelog topics to the target MSK cluster. Instead, the Kafka Streams application should be configured to start from the earliest offset of the source topics in the target cluster. This allows the state to be rebuilt. However, this method results in duplicate processing, because all the data in the topic is reprocessed. Therefore, the target destination (such as a database) must be idempotent to handle these duplicates effectively.

Express brokers don’t allow configuring segment.bytes to optimize performance. Therefore, the internal topics need to be manually created before the Kafka Streams application is migrated to the new Express based cluster. For more information, refer to Using Kafka Streams with MSK Express brokers and MSK Serverless.

Migrate Spark applications

Spark stores offsets in its checkpoint location, which should be a file system compatible with HDFS, such as Amazon Simple Storage Service (Amazon S3). After migrating the Spark application to the target MSK cluster, you should remove the checkpoint location, causing the Spark application to lose its state. To rebuild the state, configure the Spark application to start processing from the earliest offset of the source topics in the target cluster. This will lead to re-processing all the data from the start of the topic and therefore will generate duplicate data. Consequently, the target destination (such as a database) must be idempotent to effectively handle these duplicates.

Migrate Flink applications

Flink stores consumer offsets within the state of its Kafka source operator. When checkpoints are completed, the Kafka source commits the current consuming offset to provide consistency between Flink’s checkpoint state and the offsets committed on Kafka brokers. Unlike other systems, Flink applications don’t rely on the __consumer_offsets topic to track offsets; instead, they use the offsets stored in Flink’s state.

During Flink application migration, one approach is to start the application without a Savepoint. This approach discards the entire state and reverts to reading from the last committed offset of the consumer group. However, this prevents the application from accurately rebuilding the state of downstream Flink operators, leading to discrepancies in computation results. To address this, you can either avoid replicating the consumer group of the Flink application or assign a new consumer group to the application when restarting it in the target cluster. Additionally, configure the application to start reading from the earliest offset of the source topics. This enables re-processing all data from the source topics and rebuilding the state. However, this method will result in duplicate data, so the target system (such as a database) must be idempotent to handle these duplicates effectively.

Alternatively, you can reset the state of the Kafka source operator. Flink uses operator IDs (UIDs) to map the state to specific operators. When restarting the application from a Savepoint, Flink matches the state to operators based on their assigned IDs. It is recommended to assign a unique ID to each operator to enable seamless state restoration from Savepoints. To reset the state of the Kafka source operator, change its operator ID. Passing the operator ID as a parameter in a configuration file can simplify this process. Restart the Flink application with parameter --allowNonRestoredState (if you are running self-managed Flink). This will reset only the state of the Kafka source operator, leaving other operator states unaffected. As a result, the Kafka source operator resumes from the last committed offset of the consumer group, avoiding full reprocessing and state rebuilding. Although this might still produce some duplicates in the output, it results in no data loss. This approach is applicable only when using the DataStream API to build Flink applications.

Conclusion

Migrating from a Standard broker MSK cluster to an Express broker MSK cluster using MSK Replicator provides a seamless, efficient transition with minimal downtime. By following the steps and strategies discussed in this post, you can take advantage of the high-performance, cost-effective benefits of Express brokers while maintaining data consistency and application uptime.

Ready to optimize your Kafka infrastructure? Start planning your migration to Amazon MSK Express brokers today and experience improved scalability, speed, and reliability. For more details, refer to the Amazon MSK Developer Guide.

About the Author

Subham Rakshit is a Senior Streaming Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build streaming architectures so they can get value from analyzing their streaming data. His two little daughters keep him occupied most of the time outside work, and he loves solving jigsaw puzzles with them. Connect with him on LinkedIn.

Serverless ICYMI Q4 2024

2025-01-16 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/serverless-icymi-q4-2024/

Welcome to the 27^th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. At the end of a quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened in Q2 here.

Calendar showing October through December 2024

2024 Q4 calender

Serverless at re:Invent 2024

AWS re:Invent 2024 had 60,000 in-person attendees and 400,000 online viewers for the keynotes. The conference delivered 1,900 sessions from 3,500 speakers and included 546 AWS service and feature announcements.

The serverless content consisted of two tracks: Serverless (SVS) and App Integration (API). These tracks included 70 unique sessions and attracted nearly 11,000 attendees. Serverlesspresso, the coffee shop powered by serverless technology, operated in two locations during the event: the Expo Hall and the certification lounge.

Crowd of people standing around the AWS reI:nvent expo hall waiting to order coffee at the Serverlesspresso booth.

Serverlesspresso booth in the expo hall

Videos are available on Serverless Land YouTube.

AWS Lambda and Amazon Elastic Container Service (Amazon ECS) 10-year anniversary.

AWS marked significant milestones in serverless computing, celebrating 10 years of AWS Lambda and Amazon ECS. Lambda now serves over 1.5 million monthly customers and processes tens of trillions of requests each month. Amazon ECS launches more than 2.4 billion container tasks weekly and is used by over 65% of new AWS container customers.

AWS is commemorating this anniversary with insights from AWS Serverless Heroes, product leads, principal engineers, and AWS leadership sharing their perspectives on serverless evolution and future directions. These stories and insights are available at https://aws.amazon.com/serverless/10th-anniversary/.

AWS Lambda

The AWS Lambda team has spent a significant amount of time improving the Lambda development experience. Several enhancements have been made in the console as well as the local development experience.

Code-OSS as the new AWS Lambda inline editor

Lambda has launched a significant upgrade to its console by integrating Code-OSS, the open-source version of Visual Studio Code, delivering a familiar development experience directly in the cloud. The new Lambda Code Editor supports viewing larger function packages up to 50 MB, features a split-screen interface for simultaneous code editing and testing, and includes built-in Amazon Q Developer AI assistance for real-time coding suggestions. This enhancement comes at no additional cost and prioritizes accessibility with features like screen reader support and keyboard navigation. The update bridges the gap between cloud and local development by simplifying the process of downloading function code and AWS SAM templates, ultimately providing developers with a more streamlined and familiar serverless development experience. Watch the video explaining the changes in detail.

Additionally, the Lambda console enhances developer experience with two new features: a built-in CloudWatch Metrics Insights dashboard that surfaces key function metrics, and CloudWatch Logs Live Tail support for real-time log streaming and analysis, enabling faster troubleshooting without leaving the Lambda environment.

Top 10 Functions

Lambda now supports native JSON structured logging for .NET managed runtime applications, improving log searchability and analysis capabilities without requiring manual configuration of logging libraries.

Lambda has expanded its runtime support by adding Python 3.13 and Node.js 22 as both managed runtimes and container base images, providing access to the latest language features and ensuring long-term support through October 2029 and April 2027, respectively.

Lambda SnapStart capability is now available for Python and .NET runtimes, delivering sub-second startup performance for latency-sensitive applications by caching initialized execution environments.

Diagram of how SnapStart works compared to not having SnapStart

SnapStart support comparison

New CloudWatch metrics for Lambda Event Source Mappings provide enhanced visibility into event processing states for Amazon Simple Queue Service (SQS), Amazon Kinesis, and Amazon DynamoDB event sources, helping customers monitor and troubleshoot event processing issues.

Lambda introduces Provisioned Mode for Kafka event source mappings, allowing customers to optimize throughput by configuring dedicated event polling resources for applications with stringent performance requirements.

Finally, Lambda introduces an enhanced local development experience through the AWS Toolkit for Visual Studio Code, streamlining the serverless application development workflow. The update features a new Application Builder interface that guides developers through environment setup, offers sample applications, and provides quick-action buttons for common tasks like build, deploy, and invoke operations. Developers can now efficiently iterate on their code with features such as configurable build settings, step-through debugging, and the ability to sync local changes quickly to the cloud or perform full deployments. The toolkit integrates with AWS Infrastructure Composer for visual application building and includes comprehensive local testing capabilities with shareable test events. This enhancement simplifies the Lambda development process by enabling developers to author, test, debug, and deploy serverless applications without leaving their preferred IDE environment.

Screen capture of the getting started experience for serverless in a local IDE

Local IDE getting started

Amazon ECS and AWS Fargate

AWS enhances observability for containerized applications with CloudWatch Application Signals for Amazon ECS, adding infrastructure metrics correlation to existing traces and logs monitoring, enabling operators to identify and resolve performance issues across their application stack.

Amazon ECS adds service revision and deployment history tracking, allowing customers to monitor changes, track ongoing deployments, and debug deployment failures for long-running applications deployed after October 25, 2024.

A graph explaining the flow for service order and history

Service revisions and deployment history

Amazon ECS expands testing capabilities by supporting network fault injection experiments on AWS Fargate through AWS Fault Injection Service, enabling developers to verify application resilience using six different types of fault injection actions, including network disruptions and resource stress testing.

Amazon EventBridge

Amazon EventBridge announces significant performance improvements, reducing end-to-end latency by up to 94% from 2,235ms to 129.33ms at P99, enabling faster event processing for time-sensitive applications like fraud detection and gaming.

Amazon EventBridge and AWS Step Functions now integrate with private APIs through AWS PrivateLink and Amazon VPC Lattice, enabling secure connectivity between cloud and on-premises applications without custom networking code.

Screen capture of the Amazon EventBridge create connection screen showing the new Private option

Connections to Private APIs

EventBridge API destinations introduces proactive OAuth token refresh for public and private authorization endpoints, helping prevent delays and errors by automatically refreshing tokens before expiration.

AWS Step Functions

AWS Step Functions introduces the ability to export workflows as CloudFormation or SAM templates directly from the AWS console, enabling repeatable provisioning across accounts. Developers can export and customize templates from existing workflows, and use AWS Infrastructure Composer to visually connect workflows with other AWS resources.

Step Functions also adds Variables and JSONata support to enhance workflow development. Variables allow data assignment and reference between states, simplifying payload management, while JSONata provides advanced data transformation capabilities, including date formatting and mathematical operations. These features reduce the need for custom code and intermediate states, making it easier to build distributed serverless applications. Watch the in depth video to learn more.

JSONata and variables

Amazon Kinesis

Amazon Kinesis introduces significant updates to its client libraries. The new Kinesis Client Library (KCL) 3.0 reduces compute costs by up to 33% through enhanced load balancing, while the Kinesis Producer Library (KPL) 1.0 improves performance and security. Both libraries now support AWS SDK for Java 2.x and eliminate dependencies on SDK for Java 1.x, enabling seamless upgrades without requiring application code changes.

KCL 3.0 metrics

Amazon MQ

Amazon MQ adds support for AWS PrivateLink, enabling customers to access Amazon MQ API endpoints directly from their VPC through interface VPC endpoints, eliminating the need for internet access and providing enhanced security through AWS’s internal network infrastructure.

Amazon Finch

AWS announces general availability of Linux support for Finch, an open source container development tool that simplifies building, running, and publishing Linux containers across all major operating systems. The release includes support for the Finch Daemon with Docker API compatibility and is available through RPM packages for Amazon Linux 2 and Amazon Linux 2023.

Amazon Simple Queue Service (SQS)

Amazon SQS increases the in-flight message limit for FIFO queues from 20,000 to 120,000 messages, enabling higher concurrent message processing. This enhancement allows customers to scale their receivers and process up to six times more messages simultaneously, provided they have sufficient publish throughput.

Amazon Managed Streaming for Apache Kafka(Amazon MSK)

Amazon MSK now introduces Managed Streaming for Apache Flink blueprints to simplify real-time AI application development. The service enables vector-embedding generation through Amazon Bedrock, streamlining the integration of streaming data with generative AI models. Using a straightforward configuration process, users can generate and index vector embeddings in Amazon OpenSearch, while leveraging LangChain’s data chunking capabilities for enhanced data retrieval efficiency. The service handles all integration aspects between MSK, embedding models, and Amazon OpenSearch vector stores.

AWS Amplify

AWS Amplify launches the Amplify AI kit for Amazon Bedrock, providing fullstack developers with tools to integrate AI capabilities into web applications. The kit includes a customizable React UI component, secure Bedrock access, and context-sharing features, enabling developers to implement chat, search, and summarization functionalities without machine learning expertise.

AWS AppSync

AWS AppSync launches AppSync Events, enabling developers to broadcast real-time data to multiple subscribers through serverless WebSocket APIs. The service eliminates the need to build and manage WebSocket infrastructure while providing secure, scalable event broadcasting capabilities. Developers can create APIs that automatically scale and integrate with services like Amazon EventBridge. The system supports features such as channel namespaces, event handlers, and multiple authorization modes, and is available in all regions where AWS AppSync operates. Users only pay for API operations and real-time connection minutes used.

Creating an AppSunc Event API

Amazon API Gateway

Amazon API Gateway released a significant enhancement to Amazon API Gateway, enabling customers to manage private REST APIs using custom private DNS names. This highly requested feature allows API providers to use user-friendly domain names like private.example.com, while maintaining TLS encryption for security. The implementation process involves creating a private custom domain, configuring certificates through AWS Certificate Manager (ACM), mapping private APIs, and setting resource policies. The feature supports cross-account sharing through AWS Resource Access Manager (AWS RAM) and is now available in all AWS Regions, including AWS GovCloud (US).

Serverless blog posts

October

November

Serverless Office Hours

Serverless office hours videos

October

Oct 1 – Fullstack apps with Amplify Gen 2
Oct 8 – Step Functions + containers
Oct 22 – GraphQL fun with AppSync
Oct 29 – Serverless testing with Pawel Zubkiewicz

November

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on X (formerly Twitter) to see the latest news, follow conversations, and interact with the team.

Eric Johnson: @edjgeek
Julian Wood: @julian_wood
Marcia Villalba: @mavi888uy
Romain Jourdan: @rjourdan_net

And finally, visit the Serverless Land for all your serverless needs.

Automate topic provisioning and configuration using Terraform with Amazon MSK

2025-01-16 Vijay Kardile

Post Syndicated from Vijay Kardile original https://aws.amazon.com/blogs/big-data/automate-topic-provisioning-and-configuration-using-terraform-with-amazon-msk/

As organizations deploy Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters across multiple use cases, the manual management of topic configurations can be challenging. This can lead to several issues:

Inefficiency – Manual configuration is time-consuming and error-prone, especially for large deployments. Maintaining consistency across multiple configurations can be difficult. To avoid this, Kafka administrators often set the create.topics.enable property on brokers, which leads to cluster operation inefficiency.
Human error – Manual configuration increases the risk of mistakes that can disrupt data flow and impact applications relying on Amazon MSK.
Scalability challenges – Scaling an Amazon MSK environment with manual configuration is cumbersome. Adding new topics or modifying existing ones requires manual intervention, hindering agility.

These challenges highlight the need for a more automated and robust approach to MSK topic configuration management.

In this post, we address this problem by using Terraform to optimize the configuration of MSK topics. This solution supports both provisioned and serverless MSK clusters.

Solution overview

Customers want a better way to manage the overhead of topics and their configurations. Manually handling topic configurations can be cumbersome and error-prone, making it difficult to keep track of changes and updates.

To address these challenges, you can use Terraform, an infrastructure as code (IaC) tool by HashiCorp. Terraform allows you to manage and provision infrastructure declaratively. It uses human-readable configuration files written in HashiCorp Configuration Language (HCL) to define the desired state of infrastructure resources. These resources can span virtual machines, networks, databases, and a vast array of cloud provider-specific offerings.

Terraform offers a compelling solution to the challenges of manual Kafka topic configuration. Terraform allows you to define and manage your Kafka topics through code. This approach provides several key benefits:

Automation – Terraform automates the creation, modification, and deletion of MSK topics.
Consistency and repeatability – Terraform configurations provide consistent topic structures and settings across your entire Amazon MSK environment. This simplifies management and reduces the likelihood of configuration drift.
Scalability – Terraform enables you to provision and manage large numbers of MSK topics, facilitating the growth of your Amazon MSK environment.
Version control – Terraform configurations are stored in version control systems, allowing you to track changes, roll back if needed, and collaborate effectively on your Amazon MSK infrastructure.

By using Terraform for MSK topic configuration management, you can streamline your operations, minimize errors, and have a robust and scalable Amazon MSK environment.

In this post, we provide a comprehensive guide for using Terraform to manage Amazon MSK configurations. We explore the process of installing Terraform on Amazon Elastic Compute Cloud (Amazon EC2), defining and decentralizing topic configurations, and deploying and updating configurations in an automated manner.

Prerequisites

Before proceeding with the solution, make sure you have the following resources and access:

To simplify the setup, use the provided AWS CloudFormation template. This template will create the necessary Amazon MSK provisioned cluster and required resources for this post. You can create an MSK Serverless cluster using the Amazon MSK console and use it in this solution. This is sample template, not production ready, and AWS Identity and Access Management (IAM) policies should be implemented using best practices and the principle of least privilege. For more details, see Get started with AWS managed policies and move toward least-privilege permissions. An EC2 instance will be created as a part of this template. The MSK cluster and EC2 instance will be created on a single virtual private cloud (VPC); however, you can install Terraform in a different account or on different VPC. For more details, see Connect Kafka client applications securely to your Amazon MSK cluster from different VPCs and AWS accounts.
For this post, we use the latest Terraform version (1.10.x) and Terraform plugins – Mongey/Kafka provider. In Terraform, plugins are binary executables responsible for implementing resource types and providers. The plugins are installed automatically when we initialize a Terraform configuration using the terraform init
You need access to an AWS account with sufficient permissions to create and manage resources, including IAM roles and MSK clusters. For more information, see IAM access control.

By making sure you have these prerequisites in place, you will be ready to streamline your topic configurations with Terraform.

Install Terraform on your client machine

When your cluster and client machine are ready, SSH to your client machine (Amazon EC2) and install Terraform.

Run the following commands to install Terraform:

sudo yum update -y
sudo yum install -y yum-utils shadow-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
sudo yum -y install terraform

Run the following command to check the installation:
```
terraform -v
```

This indicates that Terraform installation is successful and you are ready to automate your MSK topic configuration.

Provision an MSK topic using Terraform

To provision the MSK topic, complete the following steps:

Create a new file called main.tf and copy the following code into this file, replacing the BOOTSTRAP_SERVERS and AWS_REGION information with the details for your cluster. For instructions on retrieving the bootstrap_servers information for IAM authentication from your MSK cluster, see Getting the bootstrap brokers for an Amazon MSK cluster. This script is common for Amazon MSK provisioned and MSK Serverless.
```
terraform {
required_providers {
kafka = {
source = "Mongey/kafka" }}}
provider "kafka" {
bootstrap_servers = [{BOOTSTRAP_SERVERS}]
tls_enabled       = true
sasl_mechanism    = "aws-iam"
sasl_aws_region   ={AWS_REGION}
sasl_aws_profile  = "dev" }
resource "kafka_topic" "sampleTopic" {
name               = "sampleTopic"
replication_factor = 1
partitions         = 50 }
```

Add IAM bootstrap servers endpoints in a comma separated list format:

BOOTSTRAP_SERVERS = ["b-2.mskcluster…. ","b-3.mskcluster…. ","b-1.mskcluster…. "]

Run the command terraform init to initialize Terraform and download the required providers.

The terraform init command initializes a working directory containing Terraform configuration files(main.tf). This is the first command that should be run after writing a new Terraform configuration.

Run the command terraform plan to review the run plan.

This command shows the changes that Terraform will make to the infrastructure based on the provided configuration. This step is optional but is often used as a preview of the changes Terraform will make.

If the plan looks correct, run the command terraform apply to apply the configuration.
When prompted for confirmation before proceeding, enter yes.

The terraform apply command runs the actions proposed in a Terraform plan. Terraform will create the sampleTopic topic in your MSK cluster.

After the terraform apply command is complete, verify the infrastructure has been created with the help of the kafka-topics.sh utility:
```
kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…..amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--list
```

You can use the kafka-toipcs.sh tool with the --list option to retrieve a list of topics associated with your MSK cluster. For more information, refer to the createtopic documentation.

Update the MSK topic configuration using Terraform

To update the MSK topic configuration, let’s assume we want to change the number of partitions from 50 to 10 on our topic. We need to perform the following steps:

Verify the number of partitions on the topic using the --describe command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…...amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--describe 
--topic sampleTopic

This command will show 50 partitions on the sampleTopic topic.

Modify the Terraform file main.tf and change the value of the partitions parameter to 10:

resource "kafka_topic" "sampleTopic" {
name               = " sampleTopic "
replication_factor = 1
partitions         = 10 }

Run the command terraform plan to review the run plan.

If the plan shows the changes, run the command terraform apply to apply the configuration.
When prompted for confirmation before proceeding, enter yes.

Terraform will drop and recreate the sampleTopic topic with the changed configuration.

Verify the changed number of partitions on the topic, ad rerun the --describe command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…...amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--describe --topic sampleTopic

Now, this command will show 10 partitions on the sampleTopic topic.

Delete the MSK topic using Terraform

When you no longer need the infrastructure, you can remove all resources created by your Terraform file.

Run the command terraform destroy to remove the topic.
When prompted for confirmation before proceeding, enter yes.

Terraform will delete the sampleTopic topic from your MSK cluster.

To verify, rerun the --list command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…..amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--list

Now, this command will not show the sampleTopic topic.

Conclusion

In this post, we addressed the common challenges associated with manual MSK topic configuration management and presented a robust Terraform-based solution. Using Terraform for automated topic provisioning and configuration streamlines your processes, fosters scalability, and enhances flexibility. Additionally, it facilitates automated deployments and centralized management.

We encourage you to explore Terraform as a means to optimize Amazon MSK configurations and unlock further efficiencies within your streaming data pipelines.

About the author

Vijay Kardile is a Sr. Technical Account Manager with Enterprise Support, India. With over two decades of experience in IT Consulting and Engineering, he specializes in Analytics services, particularly Amazon EMR and Amazon MSK. He has empowered numerous enterprise clients by facilitating their adoption of various AWS services and offering expert guidance on attaining operational excellence.

Fitch Group achieves multi-Region resiliency for mission-critical Kafka infrastructure with Amazon MSK Replicator

2024-12-23 Kalyan Janaki

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/fitch-group-achieves-multi-region-resiliency-for-mission-critical-kafka-infrastructure-with-amazon-msk-replicator/

Real-time data streaming and event processing are critical components of modern distributed systems architectures. Apache Kafka has emerged as a leading platform for building real-time data pipelines and enabling asynchronous communication between microservices and applications. However, running and managing Kafka clusters at scale can be challenging, requiring specialized expertise and significant operational overhead.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that allows you to build and run production Kafka applications. With Amazon MSK, you can rely on AWS to handle the heavy lifting of provisioning and managing Kafka clusters, while you focus on building innovative applications and real-time data processing pipelines.

In this post, we explore how Fitch Group, one of the top credit rating companies, used Amazon MSK and Amazon MSK Replicator to achieve multi-Region resiliency for their mission-critical Kafka infrastructure.

About Fitch Group and their need for multi-region resiliency

As a leading global financial information services provider, Fitch Group delivers vital credit and risk insights, robust data, and dynamic tools to champion more efficient, transparent financial markets. With employees in over 30 countries, Fitch Group’s culture of credibility, independence, and transparency is embedded throughout its structure, which includes Fitch Ratings, one of the world’s top three credit ratings agencies, and Fitch Solutions, a leading provider of insights, data, and analytics.

To stay competitive and efficient in the fast-paced financial industry, Fitch Group strategically adopted an event-driven microservices architecture. At the heart of this ecosystem lies Kafka, specifically Amazon MSK, which serves as the backbone for their data integration systems.

Fitch Group uses Kafka to enable applications to send ratings-related business events, facilitating automation within their ratings workflow systems and providing real-time or near real-time processing. This architectural choice has significantly reduced the time to market for end-user-facing systems like Fitch Ratings Pro and Fitch Group Ratings websites. Moreover, Kafka’s robust capabilities allow for seamless aggregation and distribution of data from many disparate systems through their data platform, enhancing data consistency, reliability, and accessibility across the organization.

Given the critical role that Kafka plays in Fitch Group architecture, providing robust disaster recovery (DR) mechanisms became paramount. Any disruption to their Kafka infrastructure could have significant repercussions on their ratings workflow automation, real-time processing, and end-user-facing systems, potentially exposing Fitch Group to regulatory, financial, and reputational risks.

To achieve the desired levels of resiliency, Fitch Group had the following key requirements:

Multi-Region deployment – Deploy MSK clusters across multiple AWS Regions to provide business continuity and maintain service availability during Regional or service events
Automated replication – Replicate Kafka data across Regions in near real time with minimal latency and data loss
Consistent topic namespaces – Maintain the same Kafka topic names and structures across source and destination clusters to minimize application changes
Rapid recovery – In the event of a failover, enable applications to seamlessly start consuming from the replicated cluster with minimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

Solution overview

Fitch Group chose to implement their multi-Region Kafka deployment using Amazon MSK and MSK Replicator. MSK Replicator is a fully managed replication service that enables continuous, automated data replication between MSK clusters within the same Region or across different Regions. It supports replicating data between clusters with different configurations, including varying broker counts, storage volumes, and Kafka versions. Here’s how Fitch Group used MSK Replicator to achieve their multi-Region resiliency goals:

Deployed MSK clusters in two separate Regions, with the primary cluster in the main Region and the secondary cluster in a different Region for disaster recovery
Configured MSK Replicator to continuously replicate data from the primary cluster to the secondary cluster, maintaining the same topic names and structures across both clusters
Implemented application failover logic to automatically switch to consuming from the secondary cluster in case of a primary cluster unavailability, with minimal recovery time and data loss

The following diagram illustrates this architecture

Benefits achieved

By implementing Amazon MSK and MSK Replicator, Fitch Group realized several key benefits:

Enhanced disaster recovery – The multi-Region deployment provides business continuity even in the face of Regional or service events.
Simplified operations – The managed capability of MSK Replicator offloads the operational complexity of self-managing custom replication solutions, reducing the burden on Fitch Group’s IT team
Scalability – The solution can scale to handle varying data loads, making sure that DR capabilities grow alongside business needs
Minimal application changes – MSK Replicator supports replicating topics with the same name, which eliminates the need for consumer application modifications, reducing development effort and potential errors
Seamless failover and failback – Bidirectional replication capabilities enable quick switching of operations to the standby Region with minimal disruption, and straightforward reversion after the primary Region is restored
Improved testing capabilities – The setup facilitates regular DR exercises without impacting production systems, allowing Fitch Group to validate their DR plans consistently

Conclusion

By using Amazon MSK and MSK Replicator, Fitch Group has successfully implemented a highly resilient and scalable Kafka infrastructure that meets their stringent business continuity and disaster recovery requirements. This multi-Region deployment enables them to process mission-critical financial data at scale while providing minimal downtime and data loss in the event of service events or disasters. As Fitch Group continues to innovate and grow, their robust Kafka infrastructure provides a solid foundation for future expansion and the development of new data-driven services, ultimately enhancing their ability to deliver timely and accurate financial insights to their clients.

About the authors

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Venu Nemallikanti is the Enterprise Architect and Lead for Event Streaming at Fitch Group, a globally recognized financial information services provider operating in over 30 countries. His primary responsibilities include overseeing the architecture and implementation of event streaming solutions, ensuring the seamless integration and performance of systems that deliver credit ratings, research, data, and analytics to a worldwide clientele.

Chaitanya Shah is a Principal Technical Account Manager with AWS, based out of New York. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

Oleg Chugaev is a Principal Solutions Architect and Serverless evangelist with 20+ years in IT, holding multiple AWS certifications. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences.

Top 6 game changers from AWS that redefine streaming data

2024-12-17 Sai Maddali

Post Syndicated from Sai Maddali original https://aws.amazon.com/blogs/big-data/top-6-game-changers-from-aws-that-redefine-streaming-data/

Recently, AWS introduced over 50 new capabilities across its streaming services, significantly enhancing performance, scale, and cost-efficiency. Some of these innovations have tripled performance, provided 20 times faster scaling, and reduced failure recovery times by up to 90%. We have made it nearly effortless for customers to bring real-time context to AI applications and lakehouses.

In this post, we discuss the top six game changers that will redefine AWS streaming data.

Amazon MSK Express brokers: Kafka reimagined for AWS

AWS offers Express brokers for Amazon Managed Streaming for Apache Kafka (Amazon MSK)—a transformative breakthrough for customers needing high-throughput Kafka clusters that scale faster and cost less. With Express brokers, we are reimagining Kafka’s compute and storage decoupling to unlock performance and elasticity benefits. Express brokers offer up to three times more throughput than a comparable standard Apache Kafka broker, virtually unlimited storage, instant storage scaling, compute scaling in minutes vs. hours, and 90% faster recovery from failures compared to standard Kafka brokers. Customers can provision capacity in minutes without complex calculations, benefit from preset Kafka configurations, and scale capacity in a few clicks. Express brokers provide the same low-latency performance as standard Kafka, are 100% native Kafka, and offer key Amazon MSK features. There are no storage limits per broker and you only pay for the storage you use. With Express brokers for Amazon MSK, enterprises can expand their Kafka usage to support even more mission-critical use cases, while keeping both operational overhead and overall infrastructure costs low.

Amazon Kinesis Data Streams On-Demand: Scaling new heights

Amazon Kinesis Data Streams On-Demand makes it uncomplicated for developers to stream gigabytes per second of data without managing capacity or servers. Developers can create a new on-demand data stream or convert an existing data stream to on-demand mode with a single click. Kinesis Data Streams On-Demand now automatically scales to 10 GBps of write throughput and 200 GBps of read throughput per stream, a fivefold increase. Customers will automatically get this fivefold increase in scale without the need to take any action.

Streaming data to Iceberg tables in lakehouses

Enterprises are embracing lakehouses and open table formats such as Apache Iceberg to unlock value from their data. Amazon Data Firehose now supports seamless integration with Iceberg tables on Amazon Simple Storage Service (Amazon S3). Customers can stream data into Iceberg tables in Amazon S3 without any management overhead. Data Firehose compacts small files, minimizing storage inefficiencies and enhancing read performance. Data Firehose also handles schema changes while in flight, to provide consistency across evolving datasets. Because Data Firehose is fully managed and serverless, it scales seamlessly to handle high throughput streaming workloads, providing reliable and fast delivery of data. This capability also makes it straightforward to stream data stored in MSK topics and Kinesis data streams into Iceberg tables, potentially eliminating the need for custom extract, transform, and load (ETL) pipelines. Customers can now bring the power of real-time data to Iceberg tables without any additional effort—a paradigm shift for businesses. Additionally, Kinesis Data Firehose serves as a versatile bridge to stream real-time data from MSK clusters and Kinesis Data Streams into the newly launched Amazon S3 Tables and Amazon SageMaker Lakehouse. This unified approach facilitates more effective data management and analysis, supporting data-driven decision-making across the enterprise.

Unlocking the value of data stored in databases with change replication to Iceberg tables

Delivering database changes into Iceberg tables is emerging as a common pattern. Now in public preview, Data Firehose supports capturing changes made in databases such as PostgreSQL and MySQL and replicating the updates to Iceberg tables on Amazon S3. The integration uses change data capture (CDC) to continuously deliver database updates, eliminating manual processes and reducing operational overhead. Data Firehose automates tasks such as schema alignment and partitioning, making sure tables are optimized for analytics. With this new capability, customers can streamline their end-to-end data pipeline, allowing them to continually feed fresh data into an Iceberg table without needing to build a custom data pipeline.

Real-time context to generative AI applications

Customers tell us how they want to gain insights from generative AI by being able to bring their data to large language models (LLMs). They want to bring data as it’s generated to pre-trained models for more accurate and up-to-date responses. Amazon MSK provides a blueprint that allows customers to combine the context from real-time data with the powerful LLMs on Amazon Bedrock to generate accurate, up-to-date AI responses without writing custom code. Developers can configure the blueprint to generate vector embeddings using Amazon Bedrock embedding models, then index those embeddings in Amazon OpenSearch Service for data captured and stored in MSK topics. Customers can also improve the efficiency of data retrieval using built-in support for data chunking techniques from LangChain, an open source library, supporting high-quality inputs for model ingestion.

More cost-effective and reliable stream processing

AWS offers the Kinesis Client Library (KCL), an open source library, that simplifies the development of stream processing applications with Kinesis Data Streams. With KCL 3.0, customers can reduce compute costs to process streaming data by up to 33% compared to previous KCL versions. KCL 3.0 introduces an enhanced load balancing algorithm that continuously monitors the resource utilization of the stream processing workers and automatically redistributes the load from over-utilized workers to underutilized workers. These changes also enhance scalability and the overall efficiency of processing large volumes of streaming data. We have also made improvements to our Amazon Managed Service for Apache Flink. We offer the latest Flink versions on Amazon Managed Service for Apache Flink for customers to benefit from the latest innovations. Customers can also upgrade their existing applications to use new Flink versions with a new in-place version upgrade feature. Amazon Managed Service for Apache Flink now offers per-second billing, so customers can run their Flink applications for a short period and only pay for what they use, down to the nearest second.

Conclusion

AWS has made new innovations in data streaming services, bringing compelling value to customers on performance, scalability, elasticity, and ease of use. These advancements empower businesses to use real-time data more effectively, which modernizes the way for the next generation of data-driven applications and analytics. It is still Day 1!

About the authors

Sai Maddali is a Senior Manager Product Management at AWS who leads the product team for Amazon MSK. He is passionate about understanding customer needs, and using technology to deliver services that empowers customers to build innovative applications. Besides work, he enjoys traveling, cooking, and running.

Bill Crew is a Senior Product Marketing Manager. He is the lead marketer for Streaming and Messaging Services at AWS. Including Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Managed Service for Apache Flink, Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon Message Broker (Amazon MQ), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Services (Amazon SNS). Besides work, he enjoys collecting vintage vinyl records.

How REA Group approaches Amazon MSK cluster capacity planning

2024-12-05 Eunice Aguilar

Post Syndicated from Eunice Aguilar original https://aws.amazon.com/blogs/big-data/how-rea-group-approaches-amazon-msk-cluster-capacity-planning/

This post was written by Eunice Aguilar and Francisco Rodera from REA Group.

Enterprises that need to share and access large amounts of data across multiple domains and services need to build a cloud infrastructure that scales as need changes. REA Group, a digital business that specializes in real estate property, solved this problem using Amazon Managed Streaming for Apache Kafka (Amazon MSK) and a data streaming platform called Hydro.

REA Group’s team of more than 3,000 people is guided by our purpose: to change the way the world experiences property. We help people with all aspects of their property experience—not just buying, selling, and renting—through the richest content, data and insights, valuation estimates, and home financing solutions. We deliver unparalleled value to our customers, Australia’s real estate agents, by providing access to the largest and most engaged audience of property seekers.

To achieve this, the different technical products within the company regularly need to move data across domains and services efficiently and reliably.

Within the Data Platform team, we have built a data streaming platform called Hydro to provide this capability across the whole organization. Hydro is powered by Amazon MSK and other tools with which teams can move, transform, and publish data at low latency using event-driven architectures. This type of structure is foundational at REA for building microservices and timely data processing for real-time and batch use cases like time-sensitive outbound messaging, personalization, and machine learning (ML).

In this post, we share our approach to MSK cluster capacity planning.

The problem

Hydro manages a large-scale Amazon MSK infrastructure by providing configuration abstractions, allowing users to focus on delivering value to REA without the cognitive overhead of infrastructure management. As the use of Hydro grows within REA, it’s crucial to perform capacity planning to meet user demands while maintaining optimal performance and cost-efficiency.

Hydro uses provisioned MSK clusters in development and production environments. In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. Proper capacity planning makes sure the clusters can handle high traffic and provide all users with the desired level of service.

Real-time streaming is a relatively new technology at REA. Many users aren’t yet familiar with Apache Kafka, and accurately assessing their workload requirements can be challenging. As the custodians of the Hydro platform, it’s our responsibility to find a way to perform capacity planning to proactively assess the impact of the user workloads on our clusters.

Goals

Capacity planning involves determining the appropriate size and configuration of the cluster based on current and projected workloads, as well as considering factors such as data replication, network bandwidth, and storage capacity.

Without proper capacity planning, Hydro clusters can become overwhelmed by high traffic and fail to provide users with the desired level of service. Therefore, it’s very important to us to invest time and resources into capacity planning to make sure Hydro clusters can deliver the performance and availability that modern applications require.

The capacity planning approach we follow for Hydro covers three main areas:

The models used for the calculation of current and estimated future capacity needs, including the attributes used as variables in them
The models used to assess the approximate expected capacity required for a new Hydro workload joining the platform
The tooling available to operators and custodians to assess the historical and current capacity consumption of the platform and, based on them, the available headroom

The following diagram shows the interaction of capacity usage and the precalculated maximum usage.

Although we don’t have this capability yet, the goal is to take this approach one step further in the future and predict the approximate resource depletion time, as shown in the following diagram.

To make sure our digital operations are resilient and efficient, we must maintain a comprehensive observability of our current capacity usage. This detailed oversight allows us not only to understand the performance limits of our existing infrastructure, but also to identify potential bottlenecks before they impact our services and users.

By proactively setting and monitoring well-understood thresholds, we can receive timely alerts and take necessary scaling actions. This approach makes sure our infrastructure can meet demand spikes without compromising on performance, ultimately supporting a seamless user experience and maintaining the integrity of our system.

Solution overview

The MSK clusters in Hydro are configured with a PER_TOPIC_PER_BROKER level of monitoring, which provides metrics at the broker and topic levels. These metrics help us determine the attributes of the cluster usage effectively.

However, it wouldn’t be wise to display an excessive number of metrics on our monitoring dashboards because that could lead to less clarity and slower insights on the cluster. It’s more valuable to choose the most relevant metrics for capacity planning rather than displaying numerous metrics.

Cluster usage attributes

Based on the Amazon MSK best practices guidelines, we have identified several key attributes to assess the health of the MSK cluster. These attributes include the following:

In/out throughput
CPU usage
Disk space usage
Memory usage
Producer and consumer latency
Producer and consumer throttling

For more information on right-sizing your clusters, see Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost, Best practices for Standard brokers, Monitor CPU usage, Monitor disk space, and Monitor Apache Kafka memory.

The following table contains the detailed list of all the attributes we use for MSK cluster capacity planning in Hydro.

Attribute Name	Attribute Type	Units	Comments
Bytes in	Throughput	Bytes per second	Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Bytes out	Throughput	Bytes per second	Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Consumer latency	Latency	Milliseconds	High or unacceptable latency values usually indicate user experience degradation before reaching actual resource (for example, CPU and memory) depletion
CPU usage	Capacity limits	% CPU user + CPU system	Should stay under 60%
Disk space usage	Persistent storage	Bytes	Should stay under 85%
Memory usage	Capacity limits	% Memory in use	Should stay under 60%
Producer latency	Latency	Milliseconds	High or unacceptable sustained latency values usually indicate user experience degradation before reaching actual capacity limits or actual resource (for example, CPU or memory) depletion
Throttling	Capacity limits	Milliseconds, bytes, or messages	High or unacceptable sustained throttling values indicate capacity limits are being reached before actual resource (for example, CPU or memory) depletion

By monitoring these attributes, we can quickly evaluate the performance of the clusters as we add more workloads to the platform. We then match these attributes to the relevant MSK metrics available.

Cluster capacity limits

During the initial capacity planning, our MSK clusters weren’t receiving enough traffic to provide us with a clear idea of their capacity limits. To address this, we used the AWS performance testing framework for Apache Kafka to evaluate the theoretical performance limits. We conducted performance and capacity tests on the test MSK clusters that had the same cluster configurations as our development and production clusters. We obtained a more comprehensive understanding of the cluster’s performance by conducting these various test scenarios. The following figure shows an example of a test cluster’s performance metrics.

To perform the tests within a specific time frame and budget, we focused on the test scenarios that could efficiently measure the cluster’s capacity. For instance, we conducted tests that involved sending high-throughput traffic to the cluster and creating topics with many partitions.

After every test, we collected the metrics of the test cluster and extracted the maximum values of the key cluster usage attributes. We then consolidated the results and determined the most appropriate limits of each attribute. The following screenshot shows an example of the exported test cluster’s performance metrics.

Capacity monitoring dashboards

As part of our platform management process, we conduct monthly operational reviews to maintain optimal performance. This involves analyzing an automated operational report that covers all the systems on the platform. During the review, we evaluate the service level objectives (SLOs) based on select service level indicators (SLIs) and assess the monitoring alerts triggered from the previous month. By doing so, we can identify any issues and take corrective actions.

To assist us in conducting the operational reviews and to provide us with an overview of the cluster’s usage, we developed a capacity monitoring dashboard, as shown in the following screenshot, for each environment. We built the dashboard as infrastructure as code (IaC) using the AWS Cloud Development Kit (AWS CDK). The dashboard is generated and managed automatically as a component of the platform infrastructure, along with the MSK cluster.

By defining the maximum capacity limits of the MSK cluster in a configuration file, the limits are automatically loaded into the capacity dashboard as annotations in the Amazon CloudWatch graph widgets. The capacity limits annotations are clearly visible and provide us with a view of the cluster’s capacity headroom based on usage.

We determined the capacity limits for throughput, latency, and throttling through the performance testing. Capacity limits of the other metrics, such as CPU, disk space, and memory, are based on the Amazon MSK best practices guidelines.

During the operational reviews, we proactively assess the capacity monitoring dashboards to determine if more capacity needs to be added to the cluster. This approach allows us to identify and address potential performance issues before they have a significant impact on user workloads. It’s a preventative measure rather than a reactive response to a performance degradation.

Preemptive CloudWatch alarms

We have implemented preemptive CloudWatch alarms in addition to the capacity monitoring dashboards. These alarms are configured to alert us before a specific capacity metric reaches its threshold, notifying us when the sustained value reaches 80% of the capacity limit. This method of monitoring enables us to take immediate action instead of waiting for our monthly review cadence.

Value added by our capacity planning approach

As operators of the Hydro platform, our approach to capacity planning has provided a consistent way to assess how far we are from the theoretical capacity limits of all our clusters, regardless of their configuration. Our capacity monitoring dashboards are a key observability instrument that we review on a regular basis; they’re also useful while troubleshooting performance issues. They help us quickly tell if capacity constraints could be a potential root cause of any ongoing issues. This means that we can use our current capacity planning approach and tooling both proactively or reactively, depending on the situation and need.

Another benefit of this approach is that we calculate the theoretical maximum usage values that a given cluster with a specific configuration can withstand from a separate cluster without impacting any actual users of the platform. We spin up short-lived MSK clusters through our AWS CDK based automation and perform capacity tests on them. We do this quite often to assess the impact, if any, that changes made to the cluster’s configurations have on the known capacity limits. According to our current feedback loop, if these newly calculated limits change from the previously known ones, they are used to automatically update our capacity dashboards and alarms in CloudWatch.

Future evolution

Hydro is a platform that is constantly improving with the introduction of new features. One of these features includes the ability to conveniently create Kafka client applications. To meet the increasing demand, it’s essential to stay ahead of capacity planning. Although the approach discussed here has served us well so far, it’s by no means the final stage , and there are capabilities that we need to extend and areas we need to improve on.

Multi-cluster architecture

To support critical workloads, we’re considering using a multi-cluster architecture using Amazon MSK, which would also affect our capacity planning. In the future, we plan to profile workloads based on metadata, cross-check them with capacity metrics, and place them in the appropriate MSK cluster. In addition to the existing provisioned MSK clusters, we will evaluate how the Amazon MSK Serverless cluster type can complement our platform architecture.

Usage trends

We have added CloudWatch anomaly detection graphs to our capacity monitoring dashboards to track any unusual trends. However, because the CloudWatch anomaly detection algorithm only evaluates up to 2 weeks of metric data, we will reassess its usefulness as we onboard more workloads. Aside from identifying usage trends, we will explore options to implement an algorithm with predictive capabilities to detect when MSK cluster resources degrade and deplete.

Conclusion

Initial capacity planning lays a solid foundation for future improvements and provides a safe onboarding process for workloads. To achieve optimal performance of our platform, we must make sure that our capacity planning strategy evolves in line with the platform’s growth. As a result, we maintain a close collaboration with AWS to continually develop additional features that meet our business needs and are in sync with the Amazon MSK roadmap. This makes sure we stay ahead of the curve and can deliver the best possible experience to our users.

We recommend all Amazon MSK users not miss out on maximizing their cluster’s potential and to start planning their capacity. Implementing the strategies listed in this post is a great first step and will lead to smoother operations and significant savings in the long run.

About the Authors

Eunice Aguilar is a Staff Data Engineer at REA. She has worked in software engineering in various industries throughout the years and recently for property data. She’s also an advocate for women interested in transitioning into tech, along with the well-versed who she takes inspiration from.

Francisco Rodera is a Staff Systems Engineer at REA. He has extensive experience building and operating large-scale distributed systems. His interests are automation, observability, and applying SRE practices to business-critical services and platforms.

Khizer Naeem is a Technical Account Manager at AWS. He specializes in Efficient Compute and has a deep passion for Linux and open-source technologies, which he leverages to help enterprise customers modernize and optimize their cloud workloads.

Introducing Provisioned Mode for Kafka Event Source Mappings with AWS Lambda

2024-11-26 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/introducing-provisioned-mode-for-kafka-event-source-mappings-with-aws-lambda/

This post is written by Tarun Rai Madan, Principal Product Manager, Serverless Compute and Rajesh Kumar Pandey, Principal Software Engineer, Serverless Compute

AWS is announcing the general availability of Provisioned Mode for AWS Lambda Event Source Mappings (ESMs) that subscribe to Apache Kafka event sources including Amazon MSK and self-managed Kafka. Provisioned Mode allows you to optimize the throughput of your Kafka ESM by provisioning event polling resources that remain ready to handle sudden spikes in traffic. Controlling the throughput of your ESM helps you build highly responsive and scalable event-driven Kafka applications with stringent performance requirements.

Overview

When you build modern applications using Event-Driven Architectures (EDAs), your event producers publish events, which are then processed by event source connectors like an ESM, and routed to serverless compute consumers like Lambda functions. Apache Kafka is a popular open-source platform for building real-time streaming data applications using Lambda functions as consumers. AWS Lambda’s fully-managed MSK ESM or self-managed Kafka ESM reads events from Kafka as an event source, performs operations like filtering and batching, and invokes Lambda functions. Both ESMs offer built-in integrations with event sources, auto-scaling, and features like batching and filtering. When a Kafka ESM is created, Lambda ESM allocates one event poller to start polling for messages in a Kafka topic. The ESM then evaluates the message backlog – using the OffsetLag metric – for all partitions in the topic, and auto-scales event pollers to process messages efficiently.

Many real-time applications using Kafka are sensitive to sudden spikes in traffic, which could lead to noticeable delays in your end users’ experience. Previously, there were no controls to optimize the throughput for performance-sensitive workloads when using Kafka ESMs. This forced you to explore alternative solutions for workloads with strict performance requirements, which added architectural complexity. To harness the power of Lambda for such performance-sensitive applications, you need to be able to control your Kafka ESM’s throughput and ensure responsive auto-scaling behavior.

What’s new

Provisioned Mode for ESM is a feature that helps you control the throughput of your ESM, and achieve an enhanced performance profile for performance-sensitive applications, particularly ones that see sudden spikes in traffic. You can use Provisioned Mode for Kafka ESM with a range of Kafka or Kafka-compatible streaming data platform providers like Amazon MSK, Confluent, Redpanda, and self-managed Kafka. Key benefits include:

Controls to optimize throughput: You can now fine-tune the throughput of your ESM by configuring a minimum and maximum number of resources called “event pollers”. An event poller (or a “poller”) represents a compute resource that underpins an ESM in the Provisioned Mode, and allocates up to 5 MB/s throughput.
Responsive auto-scaling: With Provisioned Mode, your Kafka ESM detects the increase in OffsetLag metric for all partitions in your Kafka topic, and auto-scales event pollers in a responsive manner. During idle periods, your ESM automatically scales down to the minimum event pollers set by you.
Simplified networking experience and charges: Previously, you were required to configure AWS PrivateLink or NAT Gateway to enable Lambda to poll messages from Kafka clusters in your VPC and invoke Lambda functions. With Provisioned Mode, you are no longer required to configure PrivateLink or NAT Gateway. This approach reduces overhead and improves the developer experience, allowing you to focus on building applications rather than managing networking setup. Consequently, you are not charged for PrivateLink VPC endpoints when using Kafka as an event source with Lambda in the Provisioned Mode for ESM, which reduces your networking charges.

Activating Provisioned Mode for ESM

To activate Provisioned Mode for a new or existing Kafka ESM, you can configure the minimum event pollers, the maximum event pollers, or both for your ESM. The allowed values range from 1 to 200 for minimum event pollers, and from 1 to 2000 for maximum event pollers.

Note that you must configure at least one of minimum or maximum event pollers to activate Provisioned Mode. When you configure only the minimum number of event pollers (‘Min-only’), your ESM allocates this minimum quantity and can dynamically scale up to a maximum. This maximum is determined by the OffsetLag and is limited by either the number of partitions or the default maximum event pollers, whichever is lower. When you configure only the maximum number of event pollers (“Max-only”), your ESM starts with one minimum poller by default, and can scale up to the maximum event pollers or number of partitions, whichever is lower. When you configure both the minimum and maximum number of event pollers (“Min and Max”), your ESM can auto-scale between this range of minimum and maximum event pollers configured.

Activating using AWS CLI

You can activate Provisioned Mode for ESM during creation of a new ESM, or by updating an existing ESM. Specify the –provisioned-poller-config parameter.

aws lambda create-event-source-mapping \
    --region <region-name> \
    --function-name <function-name> \
    --event-source-arn <event-source-arn> \
    --provisioned-poller-config '{"MinimumPollers":<number>, "MaximumPollers":<number>}'

Activating using AWS Lambda Console

Select Configure provisioned mode to activate Provisioned Mode when creating a new ESM, or updating an existing one.

Figure 1: Activating Provisioned Mode for ESM in Console

Provisioned Mode for Kafka ESM in action

To see the performance profile with Provisioned Mode for Kafka ESM, deploy a Lambda function that subscribes to an Amazon MSK topic. Use the reference pattern on Serverless Land and see this blog post outlining steps to configure MSK ESM for a Lambda function. In this case, a producer writes 20 million messages, each with 1KB payload size to an MSK topic – distributed evenly across 100 partitions. Use a batch size of 100, with function duration at 100ms, and set the StartingPosition to TRIM_HORIZON to process from the beginning of the stream.

Note the baseline performance profile observed with the default On-Demand mode. Then analyze two configurations with the Provisioned Mode activated.

Scenario 1 uses different configurations for minimum event pollers
Scenario 2 uses the default minimum event pollers and lets Lambda manage the event pollers through autoscaling.

Baseline performance profile for Kafka ESM On-demand

With Provisioned Mode disabled, Lambda takes approximately 20 minutes to drain the backlog of 20 million messages. It takes 4 minutes to reach the maximum concurrent executions. Use this result as a baseline to compare against Provisioned Mode for ESM.

Figure 2: Baseline performance without Provisioned Mode for ESM

Scenario 1: Configuring minimum event pollers, and auto-scaling

To optimize the ESM throughput for this workload and reduce the time to drain the message backlog, configure the minimum event pollers. Select values of 10 and 100 for minimum event pollers, and observe the results.

Configuring 10 minimum event pollers

Lambda drains the backlog of 20 million messages in approximately 11 minutes with minimum pollers set to 10. This is 45% faster than the baseline without Provisioned Mode. It takes approximately 6 minutes to reach maximum concurrent executions.

Figure 3: Performance profile with minimum event pollers set to 10

Configuring 100 minimum event pollers

To further improve the processing performance, configure the minimum event pollers to 100. Lambda now takes 6 minutes to drain the backlog of 20 million messages, which is 70% faster than the baseline. It instantly reaches the maximum concurrent executions.

Figure 4: Performance profile with minimum event pollers set to 100

Scenario 2: Default minimum event pollers, and auto-scaling

In some cases, the workload may not be as performance-sensitive. With the same volume of 20M messages in your Kafka topic, activate Provisioned Mode for ESM. Start with the default minimum event pollers (set to 1) and let Lambda auto-scale the event pollers based on incoming traffic.

Lambda automatically scales up your event pollers to process the incoming messages, and scales them down as the backlog is cleared. With the default minimum and maximum event pollers, Lambda takes approximately 12 minutes to clear the backlog of 20 million messages, which is 40% faster than the baseline. Lambda takes 7 minutes to reach maximum concurrent executions.

Figure 5: Performance profile with minimum event pollers set to 1

The following table summarizes the performance improvement for the analyzed workload using Provisioned Mode for ESM.

ESM Mode	Time to drain message backlog	Percentage improvement
On-demand Mode	20 minutes	Baseline
Provisioned Mode: Scenario 1 (fine-tuned minimum event pollers)
Minimum event pollers = 10	11 minutes	45%
Minimum event pollers = 100	6 minutes	70%
Provisioned Mode: Scenario 2 (default minimum event pollers)
Minimum event pollers = 1	12 minutes	40%

Table: Performance profile for reference test case before and after activating Provisioned Mode for ESM

Observability and Pricing

You can observe the usage of event pollers by monitoring the ProvisionedPollers Amazon CloudWatch metric, which measures the number of event pollers that actively processed at least one event in the last 5-minute window.

Pricing is based on the provisioned minimum event pollers and the number of event pollers consumed during automatic scaling. Provisioned Mode introduces a billing unit called Event Poller Unit (EPU). Each EPU supports up to 20 MB/s of throughput for event polling. The number of event pollers allocated on an EPU depends on the throughput consumed by each event poller. You pay for the number of EPUs used and the duration they run for, measured in Event Poller Unit hours. For details, refer to AWS Lambda pricing.

Best practices and considerations

The optimal configuration of minimum and maximum event pollers for your Kafka Event Source Mapping (ESM) depends on your application’s performance requirements. Start with the default minimum event pollers to baseline the performance profile, and adjust event pollers based on observed message processing patterns and your application’s performance requirements. For workloads with spiky traffic and strict performance needs, increase the minimum event pollers to handle sudden surges. You can fine-tune the minimum required event pollers by evaluating your desired throughput, your observed throughput – which depends on factors like the ingested messages per second and average payload size, and using the throughput capacity of one event poller (up to 5 MB/s) as reference. Note that to maintain ordered processing within a partition, Lambda caps the maximum event pollers at the number of partitions in the topic.

Update your network settings to remove PrivateLink VPC endpoints and associated permissions for existing ESMs when you activate Provisioned Mode.

Conclusion

Provisioned Mode for Lambda ESM allows you to fine-tune the throughput for your Kafka ESMs by configuring a minimum and maximum number of event pollers. This provides a responsive auto-scaling behavior for Kafka applications that have stringent performance requirements and see unpredictable and spiky traffic. You can fine-tune your configured event pollers based on your requirements and monitor usage via CloudWatch metrics. Provisioned Mode also simplifies network configuration by removing the requirement to configure PrivateLink.

For more serverless learning resources, visit Serverless Land.

The serverless attendee’s guide to AWS re:Invent 2024

2024-11-19 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/the-serverless-attendees-guide-to-aws-reinvent-2024/

AWS re:Invent 2024 offers an extensive selection of serverless and application integration content.

AWS re:Invent Banner

For detailed descriptions and schedule, visit the AWS re:Invent Session Catalog.

Join AWS serverless experts and community members at the AWS Modern Apps and Open Source Zone in the AWS Expo Village. This serves as a hub for serverless discussions at re:Invent. While you are there, enjoy a free coffee and learn about serverless architectures at the Serverlesspresso booth. There are two this year, another one at the Certificate Lounge. The AWS Expo Village also includes Serverless and Serverless Containers booths.

Don’t have a ticket yet? Join us in Las Vegas from November 28-December 2, 2022 by registering for re:Invent 2024.

This guide organizes the sessions into categories to help you find the content this is most relevant to you.

Session Types

Breakout Sessions are lecture-style presentations covering architecture, best practices, and deep dives into AWS services.
Workshops are 2-hour hands-on sessions where you work through tasks in AWS accounts using AWS services. Laptops are required and AWS credits are provided.
Chalk Talks are highly interactive 60-minute sessions with smaller audiences, focused on technical deep dives with whiteboards for architectural discussions.
Builders’ Sessions are 60-minute small-group sessions led by an AWS expert who guides you through a technical problem using AWS services.
Code Talks are 60-minute live coding sessions where AWS experts show how to build solutions using AWS services.

Leadership session: Nick Coult, Usman Khalid, Kathleen deValk

SVS211: Celebrating 10 years of pioneering serverless and containers – Breakout.
- Explore how serverless has evolved to help organizations drive the highest performance, availability, and security at low costs.

Getting started sessions

Are you new to serverless or taking your first steps? Hear from AWS experts and customers on best practices and strategies for building serverless workloads. Get hands-on with services by attending a workshop or builders session. Create the next great “to do” app or add a new customer experience for a theme park.

SVS202: Thinking serverless – Chalk Talk
- Learn how to approach building solutions with a serverless mindset by breaking down business problems into serverless building blocks.
SVS205: Building a serverless web application for a theme park – Workshop
- Learn how to build a complete serverless web application for a theme park called Innovator Island.
SVS201: Getting started with serverless patterns – Workshop
- Learn how to recognize and apply common serverless patterns by building production-ready code for a serverless application.
SVS204: Write less code: Building applications with a serverless mindset – Builders Session
- Get more value by using built-in integrations between AWS services through configuration rather than writing glue code.
SVS207: Effectively model costs for your serverless applications – Chalk Talk
- Gain insights into modeling the cost of serverless applications on AWS by considering request loads, payload sizes, and service pricing.
API201: The AWS Step Functions workshop – Workshop
- Learn about the features of AWS Step Functions through hands-on interactive modules.
API204: Building event-driven architectures – Workshop
- Learn about the basics of event-driven design using examples involving Amazon SNS, Amazon SQS, AWS Lambda, Amazon EventBridge, and more.
API205: Unlock the power of an exceptional serverless developer experience – Code Talk
- Learn how to accelerate your serverless development with AWS tools, including Amazon Q Developer integrated into IDEs.
SEG209: Getting started building serverless SaaS architectures
- Discover how to build your first serverless application, and learn how to handle multi-tenant architectures for SaaS applications.

Understanding serverless architectures

SVS208: Balance consistency and developer freedom with platform engineering – Breakout
- Learn how platform teams can provide opinionated security, cost, observability, reliability, and sustainability patterns while maintaining developer flexibility.
SVS209: Containers or serverless functions: A path for cloud-native success – Breakout
- Explore the fundamental differences between containers and serverless functions through real-world scenarios and insights into choosing the right approach.
OPN301: Level up your serverless applications with Powertools for AWS Lambda – Workshop
- Learn why Powertools for AWS Lambda can be the developer toolkit of choice for serverless workloads.
DEV341: From single to multi-tenant: Scaling a mission-critical serverless app
- Explore how to transition a mission-critical application from a single-tenant to a multi-tenant architecture
DEV337: Zero to production serverless in 8 weeks
- Hear about a real-world project journey, from concept to production in only eight weeks. Expect practical insights, mistakes, tips, and how using the right technologies and development process can deliver results fast.

Building event-driven applications

API204: Building event-driven architectures – Workshop
- Learn about the basics of event-driven design using examples involving Amazon SNS, Amazon SQS, AWS Lambda, Amazon EventBridge, and more.
API206: How event-driven architectures can go wrong and how to fix them – Chalk Talk
- Explore common event-driven pitfalls including YOLO events, god events, observability soup, event loops, and surprise bills.
DEV321: Choosing the right serverless compute services
- Learn when to use AWS serverless compute services like AWS Lambda and Amazon ECS on AWS Fargate and how to integrate them into your application architectures.
API307: Event-driven architectures at scale: Manage millions of events – Breakout
- Discover proven patterns for building high-scale event-driven systems that can be effectively managed across a distributed organization with Amazon EventBridge.
SVS206: Building an event sourcing system using AWS serverless technologies – Chalk Talk
- Explore strategies for building effective event sourcing architectures using AWS serverless technologies to store application state as an append-only event log.
COP408: Coding for serverless observability
- Join this code talk to learn best practices for collecting signals from your serverless applications. Dive deep into techniques to effectively instrument your applications to provide you with optimal observability.

Incorporating orchestration

API201: The AWS Step Functions workshop – Workshop
- Learn about the features of AWS Step Functions through hands-on interactive modules.
API203: Building common orchestrated workflows with AWS Step Functions – Builders Session
- Build three orchestrated workflows, including streamlined data processing with Distributed Map state, external system integration using callback, and implementing the saga pattern.
API207: Optimize data processing with built-in AWS Step Functions features – Chalk Talk
- Learn to optimize your serverless data processing workflows at scale using AWS Step Functions features, including intrinsic functions and Distributed Map state.
API402: Building advanced workflows with AWS Step Functions – Breakout
- Learn how you can use generative AI to generate state machines automatically from textual descriptions and chat with your workflow to optimize it.

Understanding integration patterns

API208: Building an integration strategy for the future – Breakout
- Boost productivity and create better customer experiences by building a modern integration strategy using AWS application, data, and file integration services.
API306: Integration patterns for distributed systems – Breakout
- Learn about common design trade-offs for distributed systems and how to navigate them with design patterns, illustrated with real-world examples.
API311: Application integration for platform builders – Breakout
- Explore the implementation of application integration using serverless components in enterprise environments.

Building APIs and frontends

SVS203: Create your first API from scratch with OpenAPI and Amazon API Gateway – Builders Session
- Learn how to design and provision complete APIs using infrastructure as code following the OpenAPI specification.
API303: Building modern API architectures: Which front door should I use? – Chalk Talk
- Explore options for building modern APIs including REST, GraphQL, and real-time APIs along with their benefits and drawbacks.
API304: Building rate-limited solutions on AWS – Chalk Talk
- Learn some of the best ways to build rate limiting into your systems for improved reliability.
API305: Asynchronous frontends: Building seamless event-driven experiences – Breakout
- Explore patterns to enable asynchronous, event-driven integrations with the frontend designed for architects and frontend, backend, and full-stack engineers.

Diving deep into advanced topics

SVS401: Best practices for serverless developers – Breakout
- Discover architectural best practices, optimizations, and useful shortcuts for building production-ready serverless workloads.
SVS403: From serverful to serverless Java – Workshop
- Learn how to bring your traditional Java Spring application to AWS Lambda with minimal effort and iteratively apply optimizations.
SVS406: Scale streaming workloads with AWS Lambda – Chalk Talk
- Learn how to implement parallel processing techniques for ordered and unordered use cases to address throughput limitations in streaming data processing.

Processing data

SVS404: Building serverless distributed data processing workloads – Workshop
- Learn how serverless technologies like AWS Step Functions and AWS Lambda can help you simplify management and scaling of distributed data processing.
API401: Multi-tenant Amazon SQS queues: Mitigating noisy neighbors – Chalk Talk
- Explore advanced strategies for managing multi-tenant Amazon SQS queues and effective mitigation techniques, including shuffle sharding and overflow queues.
SVS321: AWS Lambda and Apache Kafka for real-time data processing applications – Breakout
- Gain practical insights into building scalable, serverless data processing applications by integrating AWS Lambda with Apache Kafka.

Incorporating generative AI

API209: Generative AI at scale: Serverless workflows for enterprise-ready apps – Workshop
- Learn to build enterprise-ready, scalable generative AI applications that can scale from serving 100 to 100,000 users.
API310: Build a meeting summarization solution with generative AI & serverless – Code Talk
- See live coding of a serverless application for producing meeting summaries with generative AI using Amazon Transcribe and Amazon Bedrock, orchestrated with AWS Step Functions.
SVS319: Unlock the power of generative AI with AWS Serverless – Breakout
- Learn to harness AWS Serverless to build robust, cost-effective generative AI applications. Explore using AWS Step Functions to orchestrate complex AI workflows.
SVS325: Secure access to enterprise generative AI with serverless AI gateway – Chalk Talk
- Explore how to architect a serverless AI gateway on AWS to securely integrate and consume large language models from multiple providers.

Additional resources

For social activities see the Unofficial list of AWS re:Invent Conference and Vendor Parties.

If you are attending re:Invent, connect at our AWS Modern Apps and Open Source Zone in the AWS Expo Village. The AWS Expo Village also includes Serverless and Serverless Containers booths.

If you can not join us in-person, breakout sessions will be available via our YouTube channel after the event.

We look forward to seeing you at re:Invent 2024! For more serverless learning resources, visit Serverless Land.

AWS Weekly Roundup: 20 years of AWS News Blog, Express brokers for Amazon MSK, Windows Server 2025 images on EC2, and more (Nov 11, 2024)

2024-11-11 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-20-years-of-aws-news-blog-express-brokers-for-amazon-msk-windows-server-2025-images-on-ec2-and-more-nov-11-2024/

Happy 20th Anniversary of the AWS News Blog! On November 9, 2004, Jeff Barr published his first blog post. At the time, he started a personal blog site using TypePad. He wanted to speak to his readers with his personal voice, not the company or team.

On April 29, 2014, we created a new AWS blog site and migrated all posts to that page. There are currently over 4,300 posts on the AWS News Blog, with Jeff contributing over 3,200 of them.

Since December 2016, the AWS News Blog has added new writers, but we are still following Jeff’s leadership principals for AWS News Bloggers in accordance with Day One. What’s unique about the AWS News Blog is that the blog writers get to use the features of the product team in advance, following the Customer Obsession leadership principle, and focus on walk-throughs of how customers can quickly use them to save time, with the Frugality principle.

I am very grateful for Jeff’s fundamental and pivotal role over the past 20 years, and I look forward to the next 20 years!

Last week’s launches
Here are some launches that got my attention:

New Express brokers for Amazon MSK – Express brokers are a new broker type for Amazon MSK Provisioned designed to deliver up to three times more throughput per broker, scale up to 20 times faster, and reduce recovery time by 90 percent as compared to standard Apache Kafka brokers. Express brokers come preconfigured with Kafka best practices by default, support all Kafka APIs, and provide the same low-latency performance, so you can continue using existing client applications without any changes.

New Amazon Kinesis Client Library 3.0 – You can now reduce compute costs to process streaming data by up to 33 percent with Kinesis Client Library (KCL) 3.0, compared to previous KCL versions. KCL 3.0 introduces an enhanced load balancing algorithm that continuously monitors resource utilization of the stream processing workers and automatically redistributes the load from overutilized workers to other underutilized workers. To learn more, read the AWS Big Data Blog post.

Microsoft Windows Server 2025 images on Amazon EC2 – We now support Microsoft Windows Server 2025 with License Included (LI) Amazon Machine Images (AMIs), providing customers with an easy and flexible way to launch the latest version of Windows Server. By running Windows Server 2025 on Amazon EC2, customers can take advantage of the security, performance, and reliability of AWS with the latest Windows Server features. To learn more about running Windows Server 2025 on Amazon EC2, visit Windows Workloads on AWS.

Anthropic’s Claude 3.5 Haiku model in Amazon Bedrock – Claude 3.5 Haiku is the next generation of Anthropic’s fastest model, combining rapid response times with improved reasoning capabilities, making it ideal for tasks that require both speed and intelligence. Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in Anthropic’s previous generation, on many intelligence benchmarks—including coding. To learn more, read the AWS News Blog post.

Amazon Bedrock Prompt Management GA – You can simplify the creation, testing, versioning, and sharing of prompts in Amazon Bedrock Prompt Management. At general availability, we added new features that provide enhanced options for configuring your prompts and enabling seamless integration for invoking them in your generative AI applications, such as structured prompts and Converse and InvokeModel API integration. To learn more, read the AWS Machine Learning blog post.

Six new synthetic generative voices for Amazon Polly – The generative engine is Amazon Polly’s most advanced text-to-speech (TTS) model leveraging the generative AI technology. We added six new synthetic female-sounding generative voices: Ayanda (South African English), Léa (French), Lucia (European Spanish), Lupe (American Spanish), Mía (Mexican Spanish), and Vicki (German). This extends thirteen voices and nine locales to provide you with more options of highly expressive and engaging voices.

Amazon OpenSearch Service Extended Support – We announce the end of Standard Support and Extended Support timelines for legacy Elasticsearch versions and OpenSearch Versions. Standard Support ends on Nov 7, 2025, for legacy Elasticsearch versions up to 6.7, Elasticsearch versions 7.1 through 7.8, OpenSearch versions from 1.0 through 1.2, and OpenSearch versions 2.3 through 2.9. With Extended Support, for an incremental flat fee over regular instance pricing, you continue to get critical security updates beyond the end of Standard Support. To learn more, read the AWS Big Data Blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items that you might find interesting:

CEO’s visiting at AWS data center – Matt Garman, CEO of AWS, had a great time visiting one of our AWS data centers recently, and was able to get a look at the continuous innovation delivered by the team. Of course, it’s no surprise that Amazon’s senior executives visit fulfillment centers, contact centers, and data centers, to do real work for customers. AWS data centers are designed for customers in every aspect, for maximum resilience, performance, and energy efficiency.

AWS supports small businesses, creates jobs, sets up sustainability initiatives, and develops educational programs near AWS data centers. Get the latest updates – AWS in your community: Here’s what’s happening near data centers across the US on About Amazon News.

Amazon Q Business at Amazon – I introduced an Amazon story to use Code transformation in Amazon Q Developer to migrate more than old 30,000 Java applications to Java 17 version. It saved over 4,500 developer years of effort compared to previous manual jobs and saved the company $260 million in annual by moving to the latest Java version.

Here is another dogfooding story of Amazon Q Business at Amazon. Amazon built an internal chatbot with Amazon Q Business and it has resolved over 1 million internal Amazon developer questions, reducing time spent churning on manual technical investigations by more than 450,000 hours.

Our team onboarded Amazon Q Business with millions of internal documents and integrated Q Business into the tools our team use every day. Now, instead of waiting hours for responses to complex technical questions on Q&A boards or Slack channels, developers can get answers in seconds.

TOURCast at PGA TOUR – If you enjoy golf, this news will be of interest to you. The PGA TOUR debuted TOURCast in Japan at the 2024 ZOZO Championship to capture and disseminate better statistical data and bring fans closer to the game based on new scoring system called ShotLink, powered by CDW. This marks the first time the PGA TOUR has been able to bring this technology to Asia, leveraging the flexibility and scalability of AWS to overcome unique challenges.

PGA TOUR volunteer setting up GPS equipment on the fairway at ZOZO championship that will input specific shot data and feed back to Shotlink Select Plus. [IMAGE: PGA TOUR]

They’ve completely rebuilt their scoring system over the past two years on a new cloud stack. With AWS cloud, whether data comes from high-tech radar systems, cameras, or manual input, the system processes it all seamlessly.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS GenAI Lofts – AWS GenAI Lofts are about more than just the tech, they bring together startups, developers, investors, and industry experts. Whether you’re looking to gain deep insights, or get your questions answered by generative AI pros, our GenAI Lofts have you covered, and provide everything you need to start building your next innovation. Join events in São Paulo (through November 20), and Paris (through November 25).

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Jakarta, Indonesia (November 23), Kochi, India (December 14).

AWS re:Invent – You can still register for the annual learning event, taking place December 2–6 in Las Vegas. Surprisingly Andy Jassy, CEO of Amazon said he will come back and participate in AWS re:Invent this year. He said “As always, the priority is to make this a learning event so customers can take nuggets back and change their own customer experiences and businesses. We’ll also have a bunch of goodies for you that we’ll announce and that we think folks will like.” Let’s meet there!

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Channy

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters

2024-11-07 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-express-brokers-for-amazon-msk-to-deliver-high-throughput-and-faster-scaling-for-your-kafka-clusters/

Today, we’re announcing the general availability of Express brokers, a new broker type for Amazon Managed Streaming for Apache Kafka (Amazon MSK). It’s designed to deliver up to three times more throughput per-broker, scale up to 20 times faster, and reduce recovery time by 90 percent as compared to Standard brokers running Apache Kafka. Express brokers come preconfigured with Kafka best practices by default, support Kafka APIs, and provide the same low latency performance that Amazon MSK customers expect, so they can continue using existing client applications without any changes.

Express brokers provide improved compute and storage elasticity for Kafka applications when using Amazon MSK provisioned clusters. Amazon MSK is a fully-managed AWS service that makes it easier for you to build and run highly available and scalable applications based on Apache Kafka.

Let’s dive deeper into some of the key features that Express brokers have and the benefits they provide:

Easier operations with hands-free storage management – Express brokers offer unlimited storage without preprovisioning, eliminating disk-related bottlenecks. Cluster sizing is simpler, requiring only ingress and egress throughput divided by recommended per-broker throughput. This removes the need for proactive disk capacity monitoring and scaling, simplifying cluster management and improving resilience by eliminating a potential failure source.
Fewer brokers with up to three times throughput per broker – Higher throughput per broker allows for smaller clusters for the same workload. Standard brokers’ throughput must account for client traffic and background operations, with m7g.16xl Standard brokers safely handling 154 MBps ingress. Express brokers use opinionated settings and resource isolation, enabling m7g.16xl size instances to safely manage up to 500 MBps ingress without compromising performance or availability during cluster events.
Higher utilization with 20 times faster scaling – Express brokers reduce data movement during scaling, making them up to 20 times faster than Standard brokers. This allows for more quicker and reliable cluster resizing. You can monitor each broker’s ingress throughput capacity and add brokers within minutes, eliminating the need for over-provisioning in anticipation of traffic spikes.
Higher resilience with 90 percent faster recovery – Express brokers are designed for mission-critical applications requiring high resilience. They come preconfigured with best-practice defaults, including 3-way replication (RF=3), which reduce failures due to misconfiguration. Express brokers also recover 90 percent faster from transient failures compared to standard Apache Kafka brokers. Express brokers’ rebalancing and recovery use minimal cluster resources, simplifying capacity planning. This eliminates the risk of increased resource utilization and the need for continuous monitoring when right-sizing clusters.

You have choice options in Amazon MSK depending on your workload and preference:

	MSK provisioned		MSK Serverless
	Standard brokers	Express brokers	MSK Serverless
Configuration range	Most flexible	Flexible	Least flexible
Cluster rebalancing	Customer managed	Customer managed but up to 20x faster	MSK managed
Capacity management	Yes	Yes (compute only)	No
Storage management	Yes	No	No

Express brokers lower costs, provide higher resiliency, and lower operational overhead, making them the best choice for all Kafka workloads. If you prefer to use Kafka without managing any aspect of its capacity, its configuration, or how it scales, then you can choose Amazon MSK Serverless. This provides a fully abstracted Apache Kafka experience that eliminates the need for any infrastructure management, scales automatically, and charges you on a pay-per-use consumption model that doesn’t require you to optimize resource utilization.

Getting started with Express brokers in Amazon MSK
To get started with Express brokers, you can use the Sizing and Pricing worksheet that Amazon MSK provides. This worksheet helps you estimate the cluster size you’ll need to accommodate your workload and also gives you a rough estimate of the total monthly cost you’ll incur.

The throughput requirements of your workload are the primary factor in the size of your cluster. You should also consider other factors, such as partition and connection count to arrive at the size and number of brokers you’ll need for your cluster. For example, if your streaming application needs 30 MBps of data ingress (write) and 80 MBps data egress (read) capacity, you can use three express.m7g.large brokers to meet your throughput needs (assuming the partition count for your workload is within the maximum number of partitions that Amazon MSK recommends for an m7g.large instance).

The following table shows the recommended maximum ingress, egress, and partition counts per instance size for sustainable and safe operations. You can learn more about these recommendations in the Best practices section of Amazon MSK Developer Guide.

Instance size	Ingress (MBps)	Egress (MBps)
`express.m7g.large`	15.6	31.2
`express.m7g.4xlarge`	124.9	249.8
`express.m7g.16xlarge`	500.0	1000.0

Once you have decided the number and size of Express brokers you’ll need for your workload, go to the AWS Management Console or use the CreateCluster API to create an Amazon MSK provisioned cluster.

When you create a new cluster on the Amazon MSK console, in the Broker type option, choose Express brokers and then select the mount of compute capacity that you want to provision for the broker. As you can see in the screen shot, you can use Apache Kafka 3.6.0 version and Graviton-based instances for Express brokers. You don’t need to preprovision storage for Express brokers.

You can also customize some of these configurations to further fine-tune the performance of your clusters according to your own preferences. To learn more, visit Express broker configurations in the Amazon MSK developer guide.

To create an MSK cluster in the AWS Command Line Interface (AWS CLI), use the create-cluster command.

aws kafka create-cluster \
    --cluster-name "channy-express-cluster" \
    --kafka-version "3.6.0" \
    --number-of-broker-nodes 3 \
    --broker-node-group-info file://brokernodegroupinfo.json

A JSON file named brokernodegroupinfo.json specifies the three subnets over which you want Amazon MSK to distribute the broker nodes.

{
    "InstanceType": "express.m7g.large",
    "BrokerAZDistribution": "DEFAULT",
    "ClientSubnets": [
        "subnet-0123456789111abcd",
        "subnet-0123456789222abcd",
        "subnet-0123456789333abcd"
    ]
}

Once the cluster is created, you can use the bootstrap connection string to connect your clients to the cluster endpoints.

With Express brokers, you can scale vertically (changing instance size) or horizontally (adding brokers). Vertical scaling doubles throughput without requiring partition reassignment. Horizontal scaling adds brokers in sets of three and and allows you to create more partitions, but it requires partition reassignment for new brokers to serve traffic.

A major benefit of Express brokers is that you can add brokers and rebalance partitions within minutes. On the other hand, rebalancing partitions after adding Standard brokers can take several hours. The graph below shows the time it took to rebalance partitions after adding 3 Express brokers to a cluster and reassigning 2000 partitions to each of the new brokers.

As you can see, it took approximately 10 minutes to reassign these partitions and utilize the additional capacity of the new brokers. When we ran the same experiment on an equivalent cluster comprising of Standard brokers, partition reassignment took over 24hours.

To learn more about the partition reassignment, visit Expanding your cluster in the Apache Kafka documentation.

Things to know
Here are some things you should know about Express brokers:

Data migration – You can migrate the data in your existing Kafka or MSK cluster to a cluster composed of Express brokers using Amazon MSK Replicator, which copies both the data and the metadata of your cluster to a new cluster.
Monitoring – You can monitor your cluster composed of Express brokers in the cluster and at the broker level with Amazon CloudWatch metrics and enable open monitoring with Prometheus to expose metrics using the JMX Exporter and the Node Exporter.
Security – Just like with other broker types, Amazon MSK integrates with AWS Key Management Service (AWS KMS) to offer transparent server-side encryption for the storage in Express brokers. When you create an MSK cluster with Express brokers, you can specify the AWS KMS key that you want Amazon MSK to use to encrypt your data at rest. If you don’t specify a KMS key, Amazon MSK creates an AWS managed key for you and uses it on your behalf.

Now available
The Express broker type is available today in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland), and Europe (Stockholm) Regions.

You pay an hourly rate for Apache Kafka broker instance usage (billed at one-second resolution) for Express brokers, with varying fees depending on the size of the broker instance and active brokers in your MSK clusters. You also pay a per-GB rate for data written to an Express broker (billed at per-byte resolution). To learn more, visit the Amazon MSK pricing page.

Give Express brokers for Amazon MSK a try in the Amazon MSK console. For more information, visit the Amazon MSK Developer Guide and send feedback to AWS re:Post for Amazon MSK or through your usual AWS support contacts.

— Channy

Build up-to-date generative AI applications with real-time vector embedding blueprints for Amazon MSK

2024-11-07 Francisco Morillo

Post Syndicated from Francisco Morillo original https://aws.amazon.com/blogs/big-data/build-up-to-date-generative-ai-applications-with-real-time-vector-embedding-blueprints-for-amazon-msk/

Businesses today heavily rely on advanced technology to boost customer engagement and streamline operations. Generative AI, particularly through the use of large language models (LLMs), has become a focal point for creating intelligent applications that deliver personalized experiences. However, static pre-trained models often struggle to provide accurate and up-to-date responses without real-time data.

To help address this, we’re introducing a real-time vector embedding blueprint, which simplifies building real-time AI applications by automatically generating vector embeddings using Amazon Bedrock from streaming data in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and indexing them in Amazon OpenSearch Service.

In this post, we discuss the importance of real-time data for generative AI applications, typical architectural patterns for building Retrieval Augmented Generation (RAG) capabilities, and how to use real-time vector embedding blueprints for Amazon MSK to simplify your RAG architecture. We cover the key components required to ingest streaming data, generate vector embeddings, and store them in a vector database. This will enable RAG capabilities for your generative AI models.

The importance of real-time data with generative AI

The potential applications of generative AI extend well beyond chatbots, encompassing various scenarios such as content generation, personalized marketing, and data analysis. For example, businesses can use generative AI for sentiment analysis of customer reviews, transforming vast amounts of feedback into actionable insights. In a world where businesses continuously generate data—from Internet of Things (IoT) devices to application logs—the ability to process this data swiftly and accurately is paramount.

Traditional large language models (LLMs) are trained on vast datasets but are often limited by their reliance on static information. As a result, they can generate outdated or irrelevant responses, leading to user frustration. This limitation highlights the importance of integrating real-time data streams into AI applications. Generative AI applications need contextually rich, up-to-date information to make sure they provide accurate, reliable, and meaningful responses to end users. Without access to the latest data, these models risk delivering suboptimal outputs that fail to meet user needs. Using real-time data streams is crucial for powering next-generation generative AI applications.

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks such as answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

At the core of RAG is the ability to fetch the most relevant information from a continuously updated vector database. Vector embeddings are numerical representations that capture the relationships and meanings of words, sentences, and other data types. They enable more nuanced and effective semantic searches than traditional keyword-based systems. By converting data into vector embeddings, organizations can build robust retrieval mechanisms that enhance the output of LLMs.

At the time of writing, many processes for creating and managing vector embeddings occur in batch mode. This approach can lead to stale data in the vector database, diminishing the effectiveness of RAG applications and the responses that AI applications generate. A streaming engine capable of invoking embedding models and writing directly to a vector database can help maintain an up-to-date RAG vector database. This helps make sure generative AI models can fetch the more relevant information in real time, providing timely and more contextually accurate outputs.

Solution overview

To build an efficient real-time generative AI application, we can divide the flow of the application into two main parts:

Data ingestion – This involves ingesting data from streaming sources, converting it to vector embeddings, and storing them in a vector database
Insights retrieval – This involves invoking an LLM with user queries to retrieve insights, employing the RAG technique

Data ingestion

The following diagram outlines the data ingestion flow.

The workflow includes the following steps:

The application processes feeds from streaming sources such as social media platforms, Amazon Kinesis Data Streams, or Amazon MSK.
The incoming data is converted to vector embeddings in real time.
The vector embeddings are stored in a vector database for subsequent retrieval.

Data is ingested from a streaming source (for example, social media feeds) and processed using an Amazon Managed Service for Apache Flink application. Apache Flink is an open source stream processing framework that provides powerful streaming capabilities, enabling real-time processing, stateful computations, fault tolerance, high throughput, and low latency. It processes the streaming data, performs deduplication, and invokes an embedding model to create vector embeddings.

After the text data is converted into vectors, these embeddings are persisted in an OpenSearch Service domain, serving as a vector database. Unlike traditional relational databases, where data is organized in rows and columns, vector databases represent data points as vectors with a fixed number of dimensions. These vectors are clustered based on similarity, allowing for efficient retrieval.

OpenSearch Service offers scalable and efficient similarity search capabilities tailored for handling large volumes of dense vector data. With features like approximate k-Nearest Neighbor (k-NN) search algorithms, dense vector support, and robust monitoring through Amazon CloudWatch, OpenSearch Service alleviates the operational overhead of managing infrastructure. This makes it a suitable solution for applications requiring fast and accurate similarity-based retrieval tasks using vector embeddings.

Insights retrieval

The following diagram illustrates the flow from the user side, where the user submits a query through the frontend and receives a response from the LLM model using the retrieved vector database documents as context.

The workflow includes the following steps:

A user submits a text query.
The text query is converted into vector embeddings using the same model used for data ingestion.
The vector embeddings are used to perform a semantic search in the vector database, retrieving related vectors and associated text.
The retrieved information, along with any previous conversation history, and the user prompt are compiled into a single prompt for the LLM.
The LLM is invoked to generate a response based on the enriched prompt.

This process helps make sure the generative AI application can use the most up-to-date context when responding to user queries, providing relevant and timely insights.

Real-time vector embedding blueprints for generative applications

To facilitate the adoption of real-time generative AI applications, we are excited to introduce real-time vector embedding blueprints. This new blueprint includes a Managed Service for Apache Flink application that receives events from an MSK cluster, processes the events, and calls Amazon Bedrock using your embedding model of choice, while storing the vectors in an OpenSearch Service cluster. This new blueprint simplifies the data ingestion piece of the architecture with a low-code approach to integrate MSK streams with OpenSearch Service and Amazon Bedrock.

Implement the solution

To use real-time data from Amazon MSK as an input for generative AI applications, you need to set up several components:

An MSK stream to provide the real-time data source
An Amazon Bedrock vector embedding model to generate embeddings from the data
An OpenSearch Service vector data store to store the generated embeddings
An application to orchestrate the data flow between these components

The real-time vector embedding blueprint packages all these components into a preconfigured solution that’s straightforward to deploy. This blueprint will generate embeddings for your real-time data, store the embeddings in an OpenSearch Service vector index, and make the data available for your generative AI applications to query and process. You can access this blueprint using either the Managed Service for Apache Flink or Amazon MSK console. To get started with this blueprint, complete the following steps:

Use an existing MSK cluster or create a new one.
Choose your preferred Amazon Bedrock embedding model and make sure you have access to the model.
Choose an existing OpenSearch Service vector index to store all embeddings or create a new vector index.
Choose Deploy blueprint.

After the Managed Service for Apache Flink blueprint is up and running, all real-time data is automatically vectorized and available for generative AI applications to process.

For the detailed setup steps, see real-time vector embedding blueprint documentation

If you want to include additional data processing steps before the creation of vector embeddings, you can use the GitHub source code for this blueprint.

The real-time vector embedding blueprint reduces the time required and the level of expertise needed to set up this data integration, so you can focus on building and improving your generative AI application.

Conclusion

By integrating streaming data ingestion, vector embeddings, and RAG techniques, organizations can enhance the capabilities of their generative AI applications. Using Amazon MSK, Managed Service for Apache Flink, and Amazon Bedrock provides a solid foundation for building applications that deliver real-time insights. The introduction of the real-time vector embedding blueprint further simplifies the development process, allowing teams to focus on innovation rather than writing custom code for integration. With just a few clicks, you can configure the blueprint to continuously generate vector embeddings using Amazon Bedrock embedding models, then index those embeddings in OpenSearch Service for your MSK data streams. This allows you to combine the context from real-time data with the powerful LLMs on Amazon Bedrock to generate accurate, up-to-date AI responses without writing custom code. You can also improve the efficiency of data retrieval using built-in support for data chunking techniques from LangChain, an open source library, supporting high-quality inputs for model ingestion.

As businesses continue to generate vast amounts of data, the ability to process this information in real time will be a crucial differentiator in today’s competitive landscape. Embracing this technology allows organizations to stay agile, responsive, and innovative, ultimately driving better customer engagement and operational efficiency. Real-time vector embedding blueprint is generally available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Paris), Europe (London), Europe (Ireland) and South America (Sao Paulo) AWS Regions. Visit the Amazon MSK documentation for the list of additional Regions, which will be supported over the next few weeks.

About the authors

Anusha Dasarakothapalli is a Principal Software Engineer for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She started her software engineering career with Amazon in 2015 and worked on products such as S3-Glacier and S3 Glacier Deep Archive, before transitioning to MSK in 2022. Her primary areas of focus lie in streaming technology, distributed systems, and storage.

Shakhi Hali is a Principal Product Manager for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She is passionate about helping customers generate business value from real-time data. Before joining MSK, Shakhi was a PM with Amazon S3. In her free time, Shakhi enjoys traveling, cooking, and spending time with family.

Digish Reshamwala is a Software Development Manager for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. He started his career with Amazon in 2022 and worked on product such as AWS Fargate, before transitioning to MSK in 2024. Before joining AWS, Digish worked at NortonLifelLock and Symantec in engineering roles. He holds an MS degree from University of Southern California. His primary areas of focus lie in streaming technology and distributed computing.

Developer guidance on how to do local testing with Amazon MSK Serverless

2024-09-11 Simon Peyer

Post Syndicated from Simon Peyer original https://aws.amazon.com/blogs/big-data/developer-guidance-on-how-to-do-local-testing-with-amazon-msk-serverless/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it easy to build and run Kafka clusters on Amazon Web Services (AWS). When working with Amazon MSK, developers are interested in accessing the service locally. This allows developers to test their application with a Kafka cluster that has the same configuration as production and provides an identical infrastructure to the actual environment without needing to run Kafka locally.

An Amazon MSK Serverless private DNS endpoint is only accessible from Amazon Virtual Private Cloud (Amazon VPC) connections that have been configured to connect. It isn’t directly resolvable from your local development environment. One option is to use AWS Direct Connect or AWS VPN to be able to Connect to Amazon MSK Serverless from your on-premises network. However, building such a solution may incur cost and complexity, and it needs to be set up by a platform team.

This post presents a practical approach to accessing your Amazon MSK environment for development purposes through a bastion host using a Secure Shell (SSH) tunnel (a commonly used secure connection method). Whether you’re working with Amazon MSK Serverless, where public access is unavailable, or with provisioned MSK clusters that are intentionally kept private, this post guides you through the steps to establish a secure connection and seamlessly integrate your local development environment with your MSK resources.

Solution overview

The solution allows you to directly connect to the Amazon MSK Serverless service from your local development environment without using Direct Connect or a VPN. The service is accessed with the bootstrap server DNS endpoint boot-<<xxxxxx>>.c<<x>>.kafka-serverless.<<region-name>>.amazonaws.com on port 9098, then routed through an SSH tunnel to a bastion host, which connects to the MSK Serverless cluster. In the next step, let’s explore how to set up this connection.

The flow of the solution is as follows:

The Kafka client sends a request to connect to the bootstrap server
The DNS query for your MSK Serverless endpoint is routed to a locally configured DNS server
The locally configured DNS server routes the DNS query to localhost.
The SSH tunnel forwards all the traffic on port 9098 from the localhost to the MSK Serverless server through the Amazon Elastic Compute Cloud (Amazon EC2) bastion host.

The following image shows the architecture diagram.

Prerequisites

Before deploying the solution, you need to have the following resources deployed in your account:

An MSK Serverless cluster configured with AWS Identity and Access Management (IAM) authentication.
A bastion host instance with network access to the MSK Serverless cluster and SSH public key authentication.
AWS CLI configured with an IAM user and able to read and create topics on Amazon MSK. Use the IAM policy from Step 2: Create an IAM role in the Getting started using MSK Serverless clusters
For Windows users, install Linux on Windows with Windows Subsystem for Linux 2 (WSL 2) using Ubuntu 24.04. For guidance, refer to How to install Linux on Windows with WSL.

This guide assumes an MSK Serverless deployment in us-east-1, but it can be used in every AWS Region where MSK Serverless is available. Furthermore, we are using OS X as operating system. In the following steps replace msk-endpoint-url with your MSK Serverless endpoint URL with IAM authentication. The MSK endpoint URL has a format like boot-<<xxxxxx>>.c<<x>>.kafka-serverless.<<region-name>>.amazonaws.com.

Solution walkthrough

To access your Amazon MSK environment for development purposes, use the following walkthrough.

Configure local DNS server OSX

Install Dnsmasq as a local DNS server and configure the resolver to resolve the Amazon MSK. The solution uses Dnsmasq because it can compare DNS requests against a database of patterns and use these to determine the correct response. This functionality can match any request that ends in kafka-serverless.us-east-1.amazonaws.com and send 127.0.0.1 in response. Follow these steps to install Dnsmasq:

Update brew and install Dnsmasq using brew
```
brew up
brew install dnsmasq
```
Start the Dnsmasq service
```
sudo brew services start dnsmasq
```

Reroute all traffic for Serverless MSK (kafka-serverless.us-east-1.amazonaws.com) to 127.0.0.1

echo address=/kafka-serverless.us-east-1.amazonaws.com/127.0.0.1 >> $(brew --prefix)/etc/dnsmasq.conf

Reload Dnsmasq configuration and clear cache

sudo launchctl unload /Library/LaunchDaemons/homebrew.mxcl.dnsmasq.plist
sudo launchctl load /Library/LaunchDaemons/homebrew.mxcl.dnsmasq.plist
dscacheutil -flushcache

Configure OS X resolver

Now that you have a working DNS server, you can configure your operating system to use it. Configure the server to send only .kafka-serverless.us-east-1.amazonaws.com queries to Dnsmasq. Most operating systems that are similar to UNIX have a configuration file called /etc/resolv.conf that controls the way DNS queries are performed, including the default server to use for DNS queries. Use the following steps to configure the OS X resolver:

OS X also allows you to configure additional resolvers by creating configuration files in the /etc/resolver/ This directory probably won’t exist on your system, so your first step should be to create it:
```
sudo mkdir -p /etc/resolver
```
Create a new file with the same name as your new top-level domain (kafka-serverless.us-east-1.amazonaws.com) in the /etc/resolver/ directory and add 127.0.0.1 as a nameserver to it by entering the following command.
```
sudo tee /etc/resolver/kafka-serverless.us-east-1.amazonaws.com >/dev/null <<EOF
nameserver 127.0.0.1
EOF
```

Configure local DNS server Windows

In Windows Subsystem for Linux, first install Dnsmasq, then configure the resolver to resolve the Amazon MSK and finally add localhost as the first nameserver.

Update apt and install Dnsmasq using apt. Install the telnet utility for later tests:
```
sudo apt update
sudo apt install dnsmasq
sudo apt install telnet
```

Reroute all traffic for Serverless MSK (kafka-serverless.us-east-1.amazonaws.com) to 127.0.0.1.

echo "address=/kafka-serverless.us-east-1.amazonaws.com/127.0.0.1" | sudo tee -a /etc/dnsmasq.conf

Reload Dnsmasq configuration and clear cache.
```
sudo /etc/init.d/dnsmasq restart
```
Open /etc/resolv.conf and add the following code in the first line.
```
nameserver 127.0.0.1
```
The output should look like the following code.
```
#Some comments
nameserver 127.0.0.1
nameserver <<your_nameservers>>
..
```

Create SSH tunnel

The next step is to create the SSH tunnel, which will allow any connections made to localhost:9098 on your local machine to be forwarded over the SSH tunnel to the target Kafka broker. Use the following steps to create the SSH tunnel:

Replace bastion-host-dns-endpoint with the public DNS endpoint of the bastion host, which comes in the style of <<xyz>>.compute-1.amazonaws.com, and replace ec2-key-pair.pem with the key pair of the bastion host. Then create the SSH tunnel by entering the following command.
```
ssh -i "~/<<ec2-key-pair.pem>>" ec2-user@<<bastion-host-dns-endpoint>> -L 127.0.0.1:9098:<<msk-endpoint-url>>:9098
```
Leave the SSH tunnel running and open a new terminal window.

Test the connection to the Amazon MSK server by entering the following command.

telnet <<msk-endpoint-url>> 9098

The output should look like the following example.

Trying 127.0.0.1...
Connected to boot-<<xxxxxxxx>>.c<<x>>.kafka-serverless.us-east-1.amazonaws.com.
Escape character is '^]'.

Testing

Now configure the Kafka client to use IAM Authentication and then test the setup. You find the latest Kafka installation at the Apache Kafka Download site. Then unzip and copy the content of the Dafka folder into ~/kafka.

Download the IAM authentication and unpack it

cd ~/kafka/libs
wget https://github.com/aws/aws-msk-iam-auth/releases/download/v2.2.0/aws-msk-iam-auth-2.2.0-all.jar
cd ~

Configure Kafka properties to use IAM as the authentication mechanism

cat <<EOF > ~/kafka/config/client-config.properties

# Sets up TLS for encryption and SASL for authN.

security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.

sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.

sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;


# Encapsulates constructing a SigV4 signature based on extracted credentials.

# The SASL client bound by "sasl.jaas.config" invokes this class.

sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

EOF

Enter the following command in ~/kafka/bin to create an example topic. Make sure that the SSH tunnel created in the previous section is still open and running.

./kafka-topics.sh --bootstrap-server <<msk-endpoint-url>>:9098 --command-config ~/kafka/config/client-config.properties --create --topic ExampleTopic --partitions 10 --replication-factor 3 --config retention.ms=3600000

Cleanup

To remove the solution, complete the following steps for Mac users:

Delete the file /etc/resolver/kafka-serverless.us-east-1.amazonaws.com
Delete the entry address=/kafka-serverless.us-east-1.amazonaws.com/127.0.0.1 in the file $(brew --prefix)/etc/dnsmasq.conf
Stop the Dnsmasq service sudo brew services stop dnsmasq
Remove the Dnsmasq service sudo brew uninstall dnsmasq

To remove the solution, complete the following steps for WSL users:

Delete the file /etc/dnsmasq.conf
Delete the entry nameserver 127.0.0.1 in the file /etc/resolv.conf
Remove the Dnsmasq service sudo apt remove dnsmasq
Remove the telnet utility sudo apt remove telnet

Conclusion

In this post, I presented you with guidance on how developers can connect to Amazon MSK Serverless from local environments. The connection is done using an Amazon MSK endpoint through an SSH tunnel and a bastion host. This enables developers to experiment and test locally, without needing to setup a separate Kafka cluster.

About the Author

Simon Peyer is a Solutions Architect at Amazon Web Services (AWS) based in Switzerland. He is a practical doer and passionate about connecting technology and people using AWS Cloud services. A special focus for him is data streaming and automations. Besides work, Simon enjoys his family, the outdoors, and hiking in the mountains.

Publish and enrich real-time financial data feeds using Amazon MSK and Amazon Managed Service for Apache Flink

2024-09-09 Rana Dutt

Post Syndicated from Rana Dutt original https://aws.amazon.com/blogs/big-data/publish-and-enrich-real-time-financial-data-feeds-using-amazon-msk-and-amazon-managed-service-for-apache-flink/

Financial data feeds are real-time streams of stock quotes, commodity prices, options trades, or other real-time financial data. Companies involved with capital markets such as hedge funds, investment banks, and brokerages use these feeds to inform investment decisions.

Financial data feed providers are increasingly being asked by their customers to deliver the feed directly to them through the AWS Cloud. That’s because their customers already have infrastructure on AWS to store and process the data and want to consume it with minimal effort and latency. In addition, the AWS Cloud’s cost-effectiveness enables even small and mid-size companies to become financial data providers. They can deliver and monetize data feeds that they have enriched with their own valuable information.

An enriched data feed can combine data from multiple sources, including financial news feeds, to add information such as stock splits, corporate mergers, volume alerts, and moving average crossovers to a basic feed.

In this post, we demonstrate how you can publish an enriched real-time data feed on AWS using Amazon Managed Streaming for Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink. You can apply this architecture pattern to various use cases within the capital markets industry; we discuss some of those use cases in this post.

Apache Kafka is a high-throughput, low-latency distributed event streaming platform. Financial exchanges such as Nasdaq and NYSE are increasingly turning to Kafka to deliver their data feeds because of its exceptional capabilities in handling high-volume, high-velocity data streams.

Amazon MSK is a fully managed service that makes it easy for you to build and run applications on AWS that use Kafka to process streaming data.

Apache Flink is an opensource distributed processing engine, offering powerful programming interfaces for both stream and batch processing, with first-class support for stateful processing, event time semantics, checkpointing, snapshots and rollback. Apache Flink supports multiple programming languages, Java, Python, Scala, SQL, and multiple APIs with different level of abstraction, which can be used interchangeably in the same application.

Amazon Managed Service for Apache Flink is a fully managed, serverless experience in running Apache Flink applications. Customers can easily build real time Flink applications using any of Flink’s languages and APIs.

In this post, we use a real-time stock quotes feed from financial data provider Alpaca and add an indicator when the price moves above or below a certain threshold. The code provided in the GitHub repo allows you to deploy the solution to your AWS account. This solution was built by AWS Partner NETSOL Technologies.

Solution overview

In this solution, we deploy an Apache Flink application that enriches the raw data feed, an MSK cluster that contains the messages streams for both the raw and enriched feeds, and an Amazon OpenSearch Service cluster that acts as a persistent data store for querying the data. In a separate virtual private cloud (VPC) that acts as the customer’s VPC, we also deploy an Amazon EC2 instance running a Kafka client that consumes the enriched data feed. The following diagram illustrates this architecture.

Figure 1 – Solution architecture

The following is a step-by-step breakdown of the solution:

The EC2 instance in your VPC is running a Python application that fetches stock quotes from your data provider through an API. In this case, we use Alpaca’s API.
The application sends these quotes using Kafka client library to your kafka topic on MSK cluster. The kafka topic stores the raw quotes.
The Apache Flink application takes the Kafka message stream and enriches it by adding an indicator whenever the stock price rises or declines 5% or more from the previous business day’s closing price.
The Apache Flink application then sends the enriched data to a separate Kafka topic on your MSK cluster.
The Apache Flink application also sends the enriched data stream to Amazon OpenSearch using a Flink connector for OpenSearch. Amazon Opensearch stores the data, and OpenSearch Dashboards allows applications to query the data at any point in the future.
Your customer is running a Kafka consumer application on an EC2 instance in a separate VPC in their own AWS account. This application uses AWS PrivateLink to consume the enriched data feed securely, in real time.
All Kafka user names and passwords are encrypted and stored in AWS Secrets Manager. The SASL/SCRAM authentication protocol used here makes sure all data to and from the MSK cluster is encrypted in transit. Amazon MSK encrypts all data at rest in the MSK cluster by default.

The deployment process consists of the following high-level steps:

Launch the Amazon MSK cluster, Apache Flink application, Amazon OpenSearch Service domain, and Kafka producer EC2 instance in the producer AWS account. This step usually completes within 45 minutes.
Set up multi-VPC connectivity and SASL/SCRAM authentication for the MSK cluster. This step can take up to 30 minutes.
Launch the VPC and Kafka consumer EC2 instance in the consumer account. This step takes about 10 minutes.

Prerequisites

To deploy this solution, complete the following prerequisite steps:

Create an AWS account if you don’t already have one and log in. We refer to this as the producer account.
Create an AWS Identity and Access Management (IAM) user with full admin permissions. For instructions, refer to Create an IAM user.
Sign out and sign back in to the AWS Management Console as this IAM admin user.
Create an EC2 key pair named my-ec2-keypair in the producer account. If you already have an EC2 key pair, you can skip this step.
Follow the instructions in ALPACA_README to sign up for a free Basic account at Alpaca to get your Alpaca API key and secret key. Alpaca will provide the real-time stock quotes for our input data feed.
Install the AWS Command Line Interface (AWS CLI) on your local development machine and create a profile for the admin user. For instructions, see Set up the AWS Command Line Interface (AWS CLI).
Install the latest version of the AWS Cloud Development Kit (AWS CDK) globally:

 npm install -g aws-cdk@latest

Deploy the Amazon MSK cluster

These steps create a new provider VPC and launch the Amazon MSK cluster there. You also deploy the Apache Flink application and launch a new EC2 instance to run the application that fetches the raw stock quotes.

On your development machine, clone the GitHub repo and install the Python packages:

git clone https://github.com/aws-samples/msk-powered-financial-data-feed.git
cd msk-powered-financial-data-feed
pip install -r requirements.txt

Set the following environment variables to specify your producer AWS account number and AWS Region:
```
export CDK_DEFAULT_ACCOUNT={your_AWS_account_no}
export CDK_DEFAULT_REGION=us-east-1
```

Run the following commands to create your config.py file:

echo "mskCrossAccountId = <Your producer AWS account ID>" > config.py
echo "producerEc2KeyPairName = '' " >> config.py
echo "consumerEc2KeyPairName = '' " >> config.py
echo "mskConsumerPwdParamStoreValue= '' " >> config.py
echo "mskClusterArn = '' " >> config.py

Run the following commands to create your alpaca.conf file:

echo [alpaca] > dataFeedMsk/alpaca.conf
echo ALPACA_API_KEY=your_api_key >> dataFeedMsk/alpaca.conf
echo ALPACA_SECRET_KEY=your_secret_key >> dataFeedMsk/alpaca.conf

Edit the alpaca.conf file and replace your_api_key and your_secret_key with your Alpaca API key.

Bootstrap the environment for the producer account:

cdk bootstrap aws://{your_AWS_account_no}/{your_aws_region}

Using your editor or integrated development environment (IDE), edit the config.py file:
1. Update the mskCrossAccountId parameter with your AWS producer account number.
2. If you have an existing EC2 key pair, update the producerEc2KeyPairName parameter with the name of your key pair.
View the dataFeedMsk/parameters.py file:
1. If you are deploying in a Region other than us-east-1, update the Availability Zone IDs az1 and az2 accordingly. For example, the Availability Zones for us-west-2 would us-west-2a and us-west-2b.
2. Make sure that the enableSaslScramClientAuth, enableClusterConfig, and enableClusterPolicy parameters in the parameters.py file are set to False.
Make sure you are in the directory where the app1.py file is located. Then deploy as follows:
```
cdk deploy --all --app "python app1.py" --profile {your_profile_name}
```
Check that you now have an Amazon Simple Storage Service (Amazon S3) bucket whose name starts with awsblog-dev-artifacts containing a folder with some Python scripts and the Apache Flink application JAR file.

Deploy multi-VPC connectivity and SASL/SCRAM

Complete the following steps to deploy multi-VPC connectivity and SASL/SCRAM authentication for the MSK cluster:

Set the enableSaslScramClientAuth, enableClusterConfig, and enableClusterPolicy parameters in the config.py file to True.
Make sure you’re in the directory where the config.py file is located and deploy the multi-VPC connectivity and SASL/SCRAM authentication for the MSK cluster:

cdk deploy --all --app "python app1.py" --profile {your_profile_name}

This step can take up to 30 minutes.

To check the results, navigate to your MSK cluster on the Amazon MSK console, and choose the Properties

You should see PrivateLink turned on, and SASL/SCRAM as the authentication type.

BDB-3696-multiVPC

Copy the MSK cluster ARN.
Edit your config.py file and enter the ARN as the value for the mskClusterArn parameter, then save the updated file.

Deploy the data feed consumer

Complete the steps in this section to create an EC2 instance in a new consumer account to run the Kafka consumer application. The application will connect to the MSK cluster through PrivateLink and SASL/SCRAM.

Navigate to Parameter Store, a capability of AWS Systems Manager, in your producer account.
Copy the value of the blogAws-dev-mskConsumerPwd-ssmParamStore parameter and update the mskConsumerPwdParamStoreValue parameter in the config.py file.
Check the value of the parameter named blogAws-dev-getAzIdsParamStore and make a note of these two values.
Create another AWS account for the Kafka consumer if you don’t already have one, and log in.
Create an IAM user with admin permissions.
Log out and log back in to the console using this IAM admin user.
Make sure you are in the same Region as the Region you used in the producer account. Then create a new EC2 key pair named, for example, my-ec2-consumer-keypair, in this consumer account.
Update the value of consumerEc2KeyPairName in your config.py file with the name of the key pair you just created.
Open the AWS Resource Access Manager (AWS RAM) console in your consumer account.
Compare the Availability Zone IDs from the Systems Manager parameter store with the Availability Zone IDs shown on the AWS RAM console.
Identify the corresponding Availability Zone names for the matching Availability Zone IDs.
Open the parameters.py file in the dataFeedMsk folder and insert these Availability Zone names into the variables crossAccountAz1 and crossAccountAz2. For example, in Parameter Store, if the values are “use1-az4” and “use1-az6”, then, when you switch to the consumer account’s AWS RAM console and compare, you may find that these values correspond to the Availability Zone names “us-east-1a” and “us-east-1b”. In that case, you need to update the parameters.py file with these Availability Zone names by setting crossAccountAz1 to “us-east-1a” and crossAccountAz2 to “us-east-1b”.
Set the following environment variables, specifying your consumer AWS account ID:

export CDK_DEFAULT_ACCOUNT={your_aws_account_id}
export CDK_DEFAULT_REGION=us-east-1

Bootstrap the consumer account environment. You need to add specific policies to the AWS CDK role in this case.

cdk bootstrap aws://{your_aws_account_id}/{your_aws_region} --cloudformation-execution-policies "arn:aws:iam::aws:policy/AmazonMSKFullAccess,arn:aws:iam::aws:policy/AdministratorAccess" –-profile <your-user-profile>

You now need to grant the consumer account access to the MSK cluster.

On the console, copy the consumer AWS account number to your clipboard.
Sign out and sign back in to your producer AWS account.
On the Amazon MSK console, navigate to your MSK cluster.
Choose Properties and scroll down to Security settings.
Choose Edit cluster policy and add the consumer account root to the Principal section as follows, then save the changes:
```
"Principal": {
    "AWS": ["arn:aws:iam::<producer-acct-no>:root", "arn:aws:iam::<consumer-acct-no>:root"]
},
```

Create the IAM role that needs to be attached to the EC2 consumer instance:

aws iam create-role --role-name awsblog-dev-app-consumerEc2Role --assume-role-policy-document file://dataFeedMsk/ec2ConsumerPolicy.json --profile <your-user-profile>

Deploy the consumer account infrastructure, including the VPC, consumer EC2 instance, security groups, and connectivity to the MSK cluster:
```
cdk deploy --all --app "python app2.py" --profile {your_profile_name}
```

Run the applications and view the data

Now that we have the infrastructure up, we can produce a raw stock quotes feed from the producer EC2 instance to the MSK cluster, enrich it using the Apache Flink application, and consume the enriched feed from the consumer application through PrivateLink. For this post, we use the Flink DataStream Java API for the stock data feed processing and enrichment. We also use Flink aggregations and windowing capabilities to identify insights in a certain time window.

Run the managed Flink application

Complete the following steps to run the managed Flink application:

In your producer account, open the Amazon Managed Service for Apache Flink console and navigate to your application.
To run the application, choose Run, select Run with latest snapshot, and choose Run.
When the application changes to the Running state, choose Open Apache Flink dashboard.

You should see your application under Running Jobs.

BDB-3696-FlinkDashboard

Run the Kafka producer application

Complete the following steps to run the Kafka producer application:

On the Amazon EC2 console, locate the IP address of the producer EC2 instance named awsblog-dev-app-kafkaProducerEC2Instance.

Connect to the instance using SSH and run the following commands:

sudo su
cd environment
source alpaca-script/bin/activate
python3 ec2-script-live.py AMZN NVDA

You need to start the script during market open hours. This will run the script that creates a connection to the Alpaca API. You should see lines of output showing that it is making the connection and subscribing to the given ticker symbols.

View the enriched data feed in OpenSearch Dashboards

Complete the following steps to create an index pattern to view the enriched data in your OpenSearch dashboard:

To find the master user name for OpenSearch, open the config.py file and locate the value assigned to the openSearchMasterUsername parameter.
Open Secrets Manager and click on awsblog-dev-app-openSearchSecrets secret to retrieve the password for OpenSearch.
Navigate to your OpenSearch console and find the URL to your OpenSearch dashboard by clicking on the domain name for your OpenSearch cluster. Click on the URL and sign in using your master user name and password.
In the OpenSearch navigation bar on the left, select Dashboards Management under the Management section.
Choose Index patterns, then choose Create index pattern.
Enter amzn* in the Index pattern name field to match the AMZN ticker, then choose Next step.
Select timestamp under Time field and choose Create index pattern.
Choose Discover in the OpenSearch Dashboards navigation pane.
With amzn selected on the index pattern dropdown, select the fields to view the enriched quotes data.

The indicator field has been added to the raw data by Amazon Managed Service for Apache Flink to indicate whether the current price direction is neutral, bullish, or bearish.

Run the Kafka consumer application

To run the consumer application to consume the data feed, you first need to get the multi-VPC brokers URL for the MSK cluster in the producer account.

On the Amazon MSK console, navigate to your MSK cluster and choose View client information.
Copy the value of the Private endpoint (multi-VPC).

SSH to your consumer EC2 instance and run the following commands:

sudo su
alias kafka-consumer=/kafka_2.13-3.5.1/bin/kafka-console-consumer.sh
kafka-consumer --bootstrap-server {$MULTI_VPC_BROKER_URL} --topic amznenhanced --from-beginning --consumer.config ./customer_sasl.properties

You should then see lines of output for the enriched data feed like the following:

{"symbol":"AMZN","close":194.64,"open":194.58,"low":194.58,"high":194.64,"volume":255.0,"timestamp":"2024-07-11 19:49:00","%change":-0.8784661217630548,"indicator":"Neutral"}
{"symbol":"AMZN","close":194.77,"open":194.615,"low":194.59,"high":194.78,"volume":1362.0,"timestamp":"2024-07-11 19:50:00","%change":-0.8122628778040887,"indicator":"Neutral"}
{"symbol":"AMZN","close":194.82,"open":194.79,"low":194.77,"high":194.82,"volume":1143.0,"timestamp":"2024-07-11 19:51:00","%change":-0.7868000916660381,"indicator":"Neutral"}

In the output above, no significant changes are happening to the stock prices, so the indicator shows “Neutral”. The Flink application determines the appropriate sentiment based on the stock price movement.

Additional financial services use cases

In this post, we demonstrated how to build a solution that enriches a raw stock quotes feed and identifies stock movement patterns using Amazon MSK and Amazon Managed Service for Apache Flink. Amazon Managed Service for Apache Flink offers various features such as snapshot, checkpointing, and a recently launched Rollback API. These features allow you to build resilient real-time streaming applications.

You can apply this approach to a variety of other use cases in the capital markets domain. In this section, we discuss other cases in which you can use the same architectural patterns.

Real-time data visualization

Using real-time feeds to create charts of stocks is the most common use case for real-time market data in the cloud. You can ingest raw stock prices from data providers or exchanges into an MSK topic and use Amazon Managed Service for Apache Flink to display the high price, low price, and volume over a period of time. This is known as aggregates and is the foundation for displaying candlestick bar graphs. You can also use Flink to determine stock price ranges over time.

BDB-3696-real-time-dv

Stock implied volatility

Implied volatility (IV) is a measure of the market’s expectation of how much a stock’s price is likely to fluctuate in the future. IV is forward-looking and derived from the current market price of an option. It is also used to price new options contracts and is sometimes referred to as the stock market’s fear gauge because it tends to spike higher during market stress or uncertainty. With Amazon Managed Service for Apache Flink, you can consume data from a securities feed that will provide current stock prices and combine this with an options feed that provides contract values and strike prices to calculate the implied volatility.

Technical indicator engine

Technical indicators are used to analyze stock price and volume behavior, provide trading signals, and identify market opportunities, which can help in the decision-making process of trading. Although implied volatility is a technical indicator, there are many other indicators. There can be simple indicators such as “Simple Moving Average” that represent a measure of trend in a specific stock price based on the average of price over a period of time. There are also more complex indicators such as Relative Strength Index (RSI) that measures the momentum of a stock’s price movement. RSI is a mathematical formula that uses the exponential moving average of upward movements and downward movements.

Market alert engine

Graphs and technical indicators aren’t the only tools that you can use to make investment decisions. Alternative data sources are important, such as ticker symbol changes, stock splits, dividend payments, and others. Investors also act on recent news about the company, its competitors, employees, and other potential company-related information. You can use the compute capacity provided by Amazon Managed Service for Apache Flink to ingest, filter, transform, and correlate the different data sources to the stock prices and create an alert engine that can recommend investment actions based on these alternate data sources. Examples can range from invoking an action if dividend prices increase or decrease to using generative artificial intelligence (AI) to summarize several correlated news items from different sources into a single alert about an event.

Market surveillance

Market surveillance is the monitoring and investigation of unfair or illegal trading practices in the stock markets to maintain fair and orderly markets. Both private companies and government agencies conduct market surveillance to uphold rules and protect investors.

You can use Amazon Managed Service for Apache Flink streaming analytics as a powerful surveillance tool. Streaming analytics can detect even subtle instances of market manipulation in real time. By integrating market data feeds with external data sources, such as company merger announcements, news feeds, and social media, streaming analytics can quickly identify potential attempts at market manipulation. This allows regulators to be alerted in real time, enabling them to take prompt action even before the manipulation can fully unfold.

Markets risk management

In fast-paced capital markets, end-of-day risk measurement is insufficient. Firms need real-time risk monitoring to stay competitive. Financial institutions can use Amazon Managed Service for Apache Flink to compute intraday value-at-risk (VaR) in real time. By ingesting market data and portfolio changes, Amazon Managed Service for Apache Flink provides a low-latency, high-performance solution for continuous VaR calculations.

This allows financial institutions to proactively manage risk by quickly identifying and mitigating intraday exposures, rather than reacting to past events. The ability to stream risk analytics empowers firms to optimize portfolios and stay resilient in volatile markets.

Clean up

It’s always a good practice to clean up all the resources you created as part of this post to avoid any additional cost. To clean up your resources, complete the following steps:

Delete the CloudFormation stacks from the consumer account.
Delete the CloudFormation stacks from the provider account.

Conclusion

In this post, we showed you how to provide a real-time financial data feed that can be consumed by your customers using Amazon MSK and Amazon Managed Service for Apache Flink. We used Amazon Managed Service for Apache Flink to enrich a raw data feed and deliver it to Amazon OpenSearch. Using this solution as a template, you can aggregate multiple source feeds, use Flink to calculate in real time any technical indicator, display data and volatility, or create an alert engine. You can add value for your customers by inserting additional financial information within your feed in real time.

We hope you found this post helpful and encourage you to try out this solution to solve interesting financial industry challenges.

About the Authors

Rana Dutt is a Principal Solutions Architect at Amazon Web Services. He has a background in architecting scalable software platforms for financial services, healthcare, and telecom companies, and is passionate about helping customers build on AWS.

Amar Surjit is a Senior Solutions Architect at Amazon Web Services (AWS), where he specializes in data analytics and streaming services. He advises AWS customers on architectural best practices, helping them design reliable, secure, efficient, and cost-effective real-time analytics data systems. Amar works closely with customers to create innovative cloud-based solutions that address their unique business challenges and accelerate their transformation journeys.

Diego Soares is a Principal Solutions Architect at AWS with over 20 years of experience in the IT industry. He has a background in infrastructure, security, and networking. Prior to joining AWS in 2021, Diego worked for Cisco, supporting financial services customers for over 15 years. He works with large financial institutions to help them achieve their business goals with AWS. Diego is passionate about how technology solves business challenges and provides beneficial outcomes by developing complex solution architectures.

AWS Glue mutual TLS authentication for Amazon MSK

2024-08-07 Edward Ondari

Post Syndicated from Edward Ondari original https://aws.amazon.com/blogs/big-data/aws-glue-mutual-tls-authentication-for-amazon-msk/

In today’s landscape, data streams continuously from countless sources such as social media interactions to Internet of Things (IoT) device readings. This torrent of real-time information presents both a challenge and an opportunity for businesses. To harness the power of this data effectively, organizations need robust systems for ingesting, processing, and analyzing streaming data at scale. Enter Apache Kafka: a distributed streaming platform that has revolutionized how companies handle real-time data pipelines and build responsive, event-driven applications. AWS Glue is used to process and analyze large volumes of real-time data and perform complex transformations on the streaming data from Apache Kafka.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed Apache Kafka service. You can activate a combination of authentication modes on new or existing MSK clusters. The supported authentication modes are AWS Identity and Access Management (IAM) access control, mutual Transport Layer Security (TLS), and Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM). For more information about using IAM authentication, refer to Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication.

Mutual TLS authentication requires both the server and the client to present certificates to prove their identity. It’s ideal for hybrid applications that need a common authentication model. It’s also a commonly used authentication mechanism for business-to-business applications and is used in standards such as open banking, which enables secure open API integrations for financial institutions. For Amazon MSK, AWS Private Certificate Authority (AWS Private CA) is used to issue the X.509 certificates and for authenticating clients.

This post describes how to set up AWS Glue jobs to produce, consume, and process messages on an MSK cluster using mutual TLS authentication. AWS Glue will automatically infer the schema from the streaming data and store the metadata in the AWS Glue Data Catalog for analysis using analytics tools such as Amazon Athena.

Example use case

In our example use case, a hospital facility regularly monitors the body temperatures for patients admitted in the emergency ward using smart thermometers. Each device automatically records the patients’ temperature readings and posts the records to a central monitoring application API. Each posted record is a JSON formatted message that contains the deviceId that uniquely identifies the thermometer, a patientId to identify the patient, the patient’s temperature reading, and the eventTime when the temperature was recorded.

The central monitoring application checks the hourly average temperature readings for each patient and notifies the hospital’s healthcare workers when a patient’s average temperature exceeds accepted thresholds (36.1–37.2°C). In our case, we use the Athena console to analyze the readings.

Overview of the solution

In this post, we use an AWS Glue Python shell job to simulate incoming data from the hospital thermometers. This job produces messages that are securely written to an MSK cluster using mutual TLS authentication.

To process the streaming data from the MSK cluster, we deploy an AWS Glue Streaming extract, transform, and load (ETL) job. This job automatically infers the schema from the incoming data, stores the schema metadata in the Data Catalog, and then stores the processed data as efficient Parquet files in Amazon Simple Storage Service (Amazon S3). We use Athena to query the output table in the Data Catalog and uncover insights.

The following diagram illustrates the architecture of the solution.

Solution architecture

The solution workflow consists of the following steps:

Create a private certificate authority (CA) using AWS Certificate Manager (ACM).
Set up an MSK cluster with mutual TLS authentication.
Create a Java keystore (JKS) file and generate a client certificate and private key.
Create a Kafka connection in AWS Glue.
Create a Python shell job in AWS Glue to create a topic and push messages to Kafka.
Create an AWS Glue Streaming job to consume and process the messages.
Analyze the processed data in Athena.

Prerequisites

You should have the following prerequisites:

Access to AWS CloudShell or the AWS Command Line Interface (AWS CLI).
A VPC with a minimum of two subnets in two Availability Zones and a NAT gateway with a route to a public subnet. You can use the following AWS CloudFormation stack to set up the VPC:

This template creates two NAT gateways as shown in the following diagram. However, it’s possible to route the traffic to a single NAT gateway in one Availability Zone for test and development workloads. For redundancy in production workloads, it’s recommended that there is one NAT gateway available in each Availability Zone.

VPC setup

The stack also creates a security group with a self-referencing rule to allow communication between AWS Glue components.

Create a private CA using ACM

Complete the following steps to create a root CA. For more details, refer to Creating a private CA.

On the AWS Private CA console, choose Create a private CA.
For Mode options, select either General-purpose or Short-lived certificate for lower pricing.
For CA type options, select Root.
Provide certificate details by providing at least one distinguished name.

Create private CA

Leave the remaining default options and select the acknowledge checkbox.
Choose Create CA.
On the Actions menu, choose Install CA certificate and choose Confirm and install.

Install certificate

Set up an MSK cluster with mutual TLS authentication

Before setting up the MSK cluster, make sure you have a VPC with at least two private subnets in different Availability Zones and a NAT gateway with a route to the internet. A CloudFormation template is provided in the prerequisites section.

Complete the following steps to set up your cluster:

On the Amazon MSK console, choose Create cluster.
For Creation method, Custom create.
For Cluster type, select Provisioned.
For Broker size, you can choose kafka.t3.small for the purpose of this post.
For Number of zones, choose 2.
Choose Next.
In the Networking section, select the VPC, private subnets, and security group you created in the prerequisites section.
In the Security settings section, under Access control methods, select TLS client authentication through AWS Certificate Manager (ACM).
For AWS Private CAs, choose the AWS private CA you created earlier.

The MSK cluster creation can take up to 30 minutes to complete.

Create a JKS file and generate a client certificate and private key

Using the root CA, you generate client certificates to use for authentication. The following instructions are for CloudShell, but can also be adapted for a client machine with Java and the AWS CLI installed.

Open a new CloudShell session and run the following commands to create the certs directory and install Java:

mkdir certs
cd certs
sudo yum -y install java-11-amazon-corretto-headless

Run the following command to create a keystore file with a private key in JKS format. Replace Distinguished-Name, Example-Alias, Your-Store-Pass, and Your-Key-Pass with strings of your choice:

keytool -genkey -keystore kafka.client.keystore.jks -validity 300 -storepass Your-Store-Pass -keypass Your-Key-Pass -dname "CN=Distinguished-Name" -alias Example-Alias -storetype pkcs12

Generate a certificate signing request (CSR) with the private key created in the preceding step:

keytool -keystore kafka.client.keystore.jks -certreq -file csr.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass

Run the following command to remove the word NEW (and the single space that follows it) from the beginning and end of the file:

sed -i -E '1,$ s/NEW //' csr.pem

The file should start with -----BEGIN CERTIFICATE REQUEST----- and end with -----END CERTIFICATE REQUEST-----

Using the CSR file, create a client certificate using the following command. Replace Private-CA-ARN with the ARN of the private CA you created.

aws acm-pca issue-certificate --certificate-authority-arn Private-CA-ARN --csr fileb://csr.pem --signing-algorithm "SHA256WITHRSA" --validity Value=300,Type="DAYS"

The command should print out the ARN of the issued certificate. Save the CertificateArn value for use in the next step.

{
"CertificateArn": "arn:aws:acm-pca:region:account:certificate-authority/CA_ID/certificate/certificate_ID"
}

Use the Private-CA-ARN together with the CertificateArn (arn:aws:acp-pca:<region>:...) generated in the preceding step to retrieve the signed client certificate. This will create a client-cert.pem file.

aws acm-pca get-certificate --certificate-authority-arn Private-CA-ARN --certificate-arn Certificate-ARN | jq -r '.Certificate + "\n" + .CertificateChain' >> client-cert.pem

Add the certificate into the Java keystore so you can present it when you talk to the MSK brokers:

keytool -keystore kafka.client.keystore.jks -import -file client-cert.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass -noprompt

Extract the private key from the JKS file. Provide the same destkeypass and deststorepass and enter the keystore password when prompted.

keytool -importkeystore -srckeystore kafka.client.keystore.jks -destkeystore keystore.p12 -srcalias Example-Alias -deststorepass Your-Store-Pass -destkeypass Your-Key-Pass -deststoretype PKCS12

Convert the private key to PEM format. Enter the keystore password you provided in the previous step when prompted.

openssl pkcs12 -in keystore.p12 -nodes -nocerts -out private-key.pem

Remove the lines that begin with Bag Attributes.. from the top of the file:

sed -i -ne '/-BEGIN PRIVATE KEY-/,/-END PRIVATE KEY-/p' private-key.pem

Upload the client-cert.pem, client.keystore.jks, and private-key.pem files to Amazon S3. You can either create a new S3 bucket or use an existing bucket to store the following objects. Replace <s3://aws-glue-assets-11111111222222-us-east-1/certs/> with your S3 location.

aws s3 sync ~/certs s3://aws-glue-assets-11111111222222-us-east-1/certs/ --exclude '*' --include 'client-cert.pem' --include 'private-key.pem' --include 'kafka.client.keystore.jks'

Create a Kafka connection in AWS Glue

Complete the following steps to create a Kafka connection:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
Select Apache Kafka and choose Next.
For Amazon Managed Streaming for Apache Kafka Cluster, choose the MSK cluster you created earlier.

Create Glue Kafka connection

Choose TLS client authentication for Authentication method.
Enter the S3 path to the keystore you created earlier and provide the keystore and client key passwords you used for the -storepass and -keypass

Add authentication method to connection

Under Networking options, choose your VPC, a private subnet, and a security group. The security group should contain a self-referencing rule.
On the next page, provide a name for the connection (for example, Kafka-connection) and choose Create connection.

Create a Python shell job in AWS Glue to create a topic and push messages to Kafka

In this section, you create a Python shell job to create a new Kafka topic and push JSON messages to the topic. Complete the following steps:

On the AWS Glue console, choose ETL jobs.
In the Script section, for Engine, choose Python shell.
Choose Create script.

Create Python shell job

Enter the following script in the editor:

import sys
from awsglue.utils import getResolvedOptions
from kafka.admin import KafkaAdminClient, NewTopic
from kafka import KafkaProducer
from kafka.errors import TopicAlreadyExistsError
from urllib.parse import urlparse

import json
import uuid
import datetime
import boto3
import time
import random

# Fetch job parameters
args = getResolvedOptions(sys.argv, ['connection-names', 'client-cert', 'private-key'])

# Download client certificate and private key files from S3
TOPIC = 'example_topic'
client_cert = urlparse(args['client_cert'])
private_key = urlparse(args['private_key'])

s3 = boto3.client('s3')
s3.download_file(client_cert.netloc, client_cert.path.lstrip('/'),  client_cert.path.split('/')[-1])
s3.download_file(private_key.netloc, private_key.path.lstrip('/'),  private_key.path.split('/')[-1])

# Fetch bootstrap servers from connection
args = getResolvedOptions(sys.argv, ['connection-names'])
if ',' in args['connection_names']:
    raise ValueError("Choose only one connection name in the job details tab!")
glue_client = boto3.client('glue')
response = glue_client.get_connection(Name=args['connection_names'], HidePassword=True)
bootstrapServers = response['Connection']['ConnectionProperties']['KAFKA_BOOTSTRAP_SERVERS']

# Create topic and push messages 
admin_client = KafkaAdminClient(bootstrap_servers= bootstrapServers, security_protocol= 'SSL', ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
try:
    admin_client.create_topics(new_topics=[NewTopic(name=TOPIC, num_partitions=1, replication_factor=1)], validate_only=False)
except TopicAlreadyExistsError:
    # Topic already exists
    pass
admin_client.close()

# Generate JSON messages for the new topic
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'), bootstrap_servers=bootstrapServers, security_protocol='SSL', 
                         ssl_check_hostname=True, ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
                         
for i in range(1200):
    _event = {
        "deviceId": str(uuid.uuid4()),
        "patientId": "PI" + str(random.randint(1,15)).rjust(5, '0'),
        "temperature": round(random.uniform(32.1, 40.9), 1),
        "eventTime": str(datetime.datetime.now())
    }
    producer.send(TOPIC, _event)
    time.sleep(3)
    
producer.close()

On the Job details tab, provide a name for your job, such as Kafka-msk-producer.
Choose an IAM role. If you don’t have one, create one following the instructions in Configuring IAM permissions for AWS Glue.
Under Advanced properties, for Connections, choose the Kafka-connection connection you created.
Under Job parameters, add the following parameters and values:
1. Key: --additional-python-modules, value: kafka-python.
2. Key: --client-cert, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/client-cert.pem. Replace with your client-cert.pem Amazon S3 location from earlier.
3. Key: --private-key, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/private-key.pem. Replace with your private-key.pem Amazon S3 location from earlier.
Save and run the job.

You can confirm that the job run status is Running on the Runs tab.

At this point, we have successfully created a Python shell job to simulate the thermometers sending temperature readings to the monitoring application. The job will run for approximately 1 hour and push 1,200 records to Amazon MSK.

Alternatively, you can replace the Python shell job with a Scala ETL job to act as a producer to send messages to the MSK cluster. In this case, use the JKS file for authentication using ssl.keystore.type=JKS. If you’re using PEM format keys, the current version of Kafka clients libraries (2.4.1) installed in AWS Glue version 4 don’t yet support authentication through certificates in PEM format (as of this writing).

Create an AWS Glue Streaming job to consume and process the messages

You can now create an AWS Glue ETL job to consume and process the messages in the Kafka topic. AWS Glue will automatically infer the schema from the files. Complete the following steps:

On the AWS Glue console, choose Visual ETL in the navigation pane.
Choose Visual ETL to author a new job.
For Sources, choose Apache Kafka.
For Connection name, choose the node and connection name you created earlier.
For Topic name, enter the topic name (example_topic) you created earlier.
Leave the rest of the options as default.

Kafka data source

Add a new target node called Amazon S3 to store the output Parquet files generated from the streaming data.
Choose Parquet as the data format and provide an S3 output location for the generated files.
Select the option to allow AWS Glue to create a table in the Data Catalog and provide the database and table names.

S3 Output node

On the job details tab, provide the following options:
1. For the requested number of workers, enter 2.
2. For IAM Role, choose an IAM role with permissions to read and write to the S3 output location.
3. For Job timeout, enter 60 (for the job to stop after 60 minutes).
4. Under Advanced properties, for Connections, choose the connection you created.
Save and run the job.

You can confirm the S3 output location for new Parquet files created under the prefixes s3://<output-location>/ingest_year=XXXX/ingest_month=XX/ingest_day=XX/ingest_hour=XX/.

At this point, you have created a streaming job to process events from Amazon MSK and store the JSON formatted records as Parquet files in Amazon S3. AWS Glue streaming jobs are meant to be running continuously to process streaming data. We have set the timeout to stop the job after 60 minutes. You can also stop the job manually after the records have been processed to Amazon S3.

Analyze the data in Athena

Going back to our example use case, you can run the following query in Athena to monitor and track the hourly average temperature readings for patients that exceed the normal thresholds (36.1–37.2°C):

SELECT
date_format(parse_datetime(eventTime, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '%h %p') hour,
patientId,
round(avg(temperature), 1) average_temperature,
count(temperature) readings
FROM "default"."devices_data"
GROUP BY 1, 2
HAVING avg(temperature) > 37.2 or avg(temperature) < 36.1
ORDER BY 2, 1 DESC

Amazon Athena Console

Run the query multiple times and observe how the average_temperature and the number of readings changes with new incoming data from the AWS Glue Streaming job. In our example scenario, healthcare workers can use this information to identify patients who are experiencing consistent high or low body temperatures and give the required attention.

At this point, we have successfully created and ingested streaming data to our MSK cluster using mutual TLS authentication. We only needed the certificates generated by AWS Private CA to authenticate our AWS Glue clients to the MSK cluster and process the streaming data with an AWS Glue Streaming job. Finally, we used Athena to visualize the data and observed how the data changes in near real time.

Clean up

To clean up the resources created in this post, complete the following steps:

Delete the private CA you created.
Delete the MSK cluster you created.
Delete the AWS Glue connection you created.
Stop the jobs if they are still running and delete the jobs you created.
If you used the CloudFormation stack provided in the prerequisites, delete the CloudFormation stack to delete the VPC and other networking components.

Conclusion

This post demonstrated how you can use AWS Glue to consume, process, and store streaming data for Amazon MSK using mutual TLS authentication. AWS Glue Streaming automatically infers the schema and creates a table in the Data Catalog. You can then query the table using other data analysis tools like Athena, Amazon Redshift, and Amazon QuickSight to provide insights into the streaming data.

Try out the solution for yourself, and let us know your questions and feedback in the comments section.

About the Authors

Edward Okemwa is a Big Data Cloud Support Engineer (ETL) at AWS Nairobi specializing in AWS Glue and Amazon Athena. He is dedicated to providing customers with technical guidance and resolving issues related to processing and analyzing large volumes of data. In his free time, he enjoys singing choral music and playing football.

Emmanuel Mashandudze is a Senior Big Data Cloud Engineer specializing in AWS Glue. He collaborates with product teams to help customers efficiently transform data in the cloud. He helps customers design and implements robust data pipelines. Outside of work, Emmanuel is an avid marathon runner, sports enthusiast and enjoys creating memories with his family.

Improve Apache Kafka scalability and resiliency using Amazon MSK tiered storage

2024-08-02 Sai Maddali

Post Syndicated from Sai Maddali original https://aws.amazon.com/blogs/big-data/improve-apache-kafka-scalability-and-resiliency-using-amazon-msk-tiered-storage/

Since the launch of tiered storage for Amazon Managed Streaming for Apache Kafka (Amazon MSK), customers have embraced this feature for its ability to optimize storage costs and improve performance. In previous posts, we explored the inner workings of Kafka, maximized the potential of Amazon MSK, and delved into the intricacies of Amazon MSK tiered storage. In this post, we deep dive into how tiered storage helps with faster broker recovery and quicker partition migrations, facilitating faster load balancing and broker scaling.

Apache Kafka availability

Apache Kafka is a distributed log service designed to provide high availability and fault tolerance. At its core, Kafka employs several mechanisms to provide reliable data delivery and resilience against failures:

Kafka replication – Kafka organizes data into topics, which are further divided into partitions. Each partition is replicated across multiple brokers, with one broker acting as the leader and the others as followers. If the leader broker fails, one of the follower brokers is automatically elected as the new leader, providing continuous data availability. The replication factor determines the number of replicas for each partition. Kafka maintains a list of in-sync replicas (ISRs) for each partition, which are the replicas that are up to date with the leader.
Producer acknowledgments – Kafka producers can specify the required acknowledgment level for write operations. This makes sure the data is durably persisted on the configured number of replicas before the producer receives an acknowledgment, reducing the risk of data loss.
Consumer group rebalancing – Kafka consumers are organized into consumer groups, where each consumer in the group is responsible for consuming a subset of the partitions. If a consumer fails, the partitions it was consuming are automatically reassigned to the remaining consumers in the group, providing continuous data consumption.
Zookeeper or KRaft for cluster coordination – Kafka relies on Apache ZooKeeper or KRaft for cluster coordination and metadata management. It maintains information about brokers, topics, partitions, and consumer offsets, enabling Kafka to recover from failures and maintain a consistent state across the cluster.

Kafka’s storage architecture and its impact on availability and resiliency

Although Kafka provides robust fault-tolerance mechanisms, in the traditional Kafka architecture, brokers store data locally on their attached storage volumes. This tight coupling of storage and compute resources can lead to several issues, impacting availability and resiliency of the cluster:

Slow broker recovery – When a broker fails, the recovery process involves transferring data from the remaining replicas to the new broker. This data transfer can be slow, especially for large data volumes, leading to prolonged periods of reduced availability and increased recovery times.
Inefficient load balancing – Load balancing in Kafka involves moving partitions between brokers to distribute the load evenly. However, this process can be resource-intensive and time-consuming, because it requires transferring large amounts of data between brokers.
Scaling limitations – Scaling a Kafka cluster traditionally involves adding new brokers and rebalancing partitions across the expanded set of brokers. This process can be disruptive and time-consuming, especially for large clusters with high data volumes.

How Amazon MSK tiered storage improves availability and resiliency

Amazon MSK offers tiered storage, a feature that allows configuring local and remote tiers. This greatly decouples compute and storage resources and thereby addresses the aforementioned challenges, improving availability and resiliency of Kafka clusters. You can benefit from the following:

Faster broker recovery – With tiered storage, data automatically moves from the faster Amazon Elastic Block Store (Amazon EBS) volumes to the more cost-effective storage tier over time. New messages are initially written to Amazon EBS for fast performance. Based on your local data retention policy, Amazon MSK transparently transitions that data to tiered storage. This frees up space on the EBS volumes for new messages. When broker fails and recovers either due to node or volume failure, the catch-up is faster because it only needs to catch up data stored on the local tier from the leader.
Efficient load balancing – Load balancing in Amazon MSK with tiered storage is more efficient because there is less data to move while reassigning partition. This process is faster and less resource-intensive, enabling more frequent and seamless load balancing operations.
Faster scaling – Scaling an MSK cluster with tiered storage is a seamless process. New brokers can be added to the cluster without the need for a large amount of data transfer and longer time for partition rebalancing. The new brokers can start serving traffic much faster, because the catch-up process takes less time, improving the overall cluster throughput and reducing downtime during scaling operations.

As shown in the following figure, MSK brokers and EBS volumes are tightly coupled. On a three-AZ deployed cluster, when you create a topic with replication factor three, Amazon MSK spreads those three replicas across all three Availability Zones and the EBS volumes attached with that broker store all the topic data spread across three Availability Zones. If you need to move a partition from one broker to another, Amazon MSK needs to move all the segments (both active and closed) from the existing broker to the new brokers, as illustrated in the following figure.

However, when you enable tiered storage for that topic, Amazon MSK transparently moves all closed segments for a topic from EBS volumes to tiered storage. That storage provides the built-in capability for durability and high availability with virtually unlimited storage capacity. With closed segments moved to tiered storage and only active segments on the local volume, your local storage footprint remains minimal regardless of topic size. If you need to move the partition to a new broker, the data movement is very minimal across the brokers. The following figure illustrates this updated configuration.

Amazon MSK tiered storage addresses the challenges posed by Kafka’s traditional storage architecture, enabling faster broker recovery, efficient load balancing, and seamless scaling, thereby improving availability and resiliency of your cluster. To learn more about the core components of Amazon MSK tiered storage, refer to Deep dive on Amazon MSK tiered storage.

A real-world test

We hope that you now understand how Amazon MSK tiered storage can improve your Kafka resiliency and availability. To test it, we created a three-node cluster with the new m7g instance type. We created a topic with a replication factor of three and without using tiered storage. Using the Kafka performance tool, we ingested 300 GB of data into the topic. Next, we added three new brokers to the cluster. Because Amazon MSK doesn’t automatically move partitions to these three new brokers, they will remain idle until we rebalance the partitions across all six brokers.

Let’s consider a scenario where we need to move all the partitions from the existing three brokers to the three new brokers. We used the kafka-reassign-partitions tool to move the partitions from the existing three brokers to the newly added three brokers. During this partition movement operation, we observed that the CPU usage was high, even though we weren’t performing any other operations on the cluster. This indicates that the high CPU usage was due to the data replication to the new brokers. As shown in the following metrics, the partition movement operation from broker 1 to broker 2 took approximately 75 minutes to complete.

Additionally, during this period, CPU utilization was elevated.

After completing the test, we enabled tiered storage on the topic with local.retention.ms=3600000 (1 hour) and retention.ms=31536000000. We continuously monitored the RemoteCopyBytesPerSec metrics to determine when the data migration to tiered storage was complete. After 6 hours, we observed zero activity on the RemoteCopyBytesPerSec metrics, indicating that all closed segments had been successfully moved to tiered storage. For instructions to enable tiered storage on an existing topic, refer to Enabling and disabling tiered storage on an existing topic.

We then performed the same test again, moving partitions to three empty brokers. This time, the partition movement operation was completed in just under 15 minutes, with no noticeable CPU usage, as shown in the following metrics. This is because, with tiered storage enabled, all the data has already been moved to the tiered storage, and we only have the active segment in the EBS volume. The partition movement operation is only moving the small active segment, which is why it takes less time and minimal CPU to complete the operation.

Conclusion

In this post, we explored how Amazon MSK tiered storage can significantly improve the scalability and resilience of Kafka. By automatically moving older data to the cost-effective tiered storage, Amazon MSK reduces the amount of data that needs to be managed on the local EBS volumes. This dramatically improves the speed and efficiency of critical Kafka operations like broker recovery, leader election, and partition reassignment. As demonstrated in the test scenario, enabling tiered storage reduced the time taken to move partitions between brokers from 75 minutes to just under 15 minutes, with minimal CPU impact. This enhanced the responsiveness and self-healing ability of the Kafka cluster, which is crucial for maintaining reliable, high-performance operations, even as data volumes continue to grow.

If you’re running Kafka and facing challenges with scalability or resilience, we highly recommend using Amazon MSK with the tiered storage feature. By taking advantage of this powerful capability, you can unlock the true scalability of Kafka and make sure your mission-critical applications can keep pace with ever-increasing data demands.

To get started, refer to Enabling and disabling tiered storage on an existing topic. Additionally, check out Automated deployment template of Cruise Control for Amazon MSK for effortlessly rebalancing your workload.

About the Authors

Nagarjuna Koduru is a Principal Engineer in AWS, currently working for AWS Managed Streaming For Kafka (MSK). He led the teams that built MSK Serverless and MSK Tiered storage products. He previously led the team in Amazon JustWalkOut (JWO) that is responsible for real time tracking of shopper locations in the store. He played pivotal role in scaling the stateful stream processing infrastructure to support larger store formats and reducing the overall cost of the system. He has keen interest in stream processing, messaging and distributed storage infrastructure.

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

2024-07-31 Shubham Purwar

Post Syndicated from Shubham Purwar original https://aws.amazon.com/blogs/big-data/synchronize-data-lakes-with-cdc-based-upsert-using-open-table-format-aws-glue-and-amazon-msk/

In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data. Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in a downstream system. Capturing every change from transactions in a source database and moving them to the target keeps the systems synchronized, and helps with analytics use cases and zero-downtime database migrations.

However, efficiently managing and synchronizing data within these lakes presents a significant challenge. Maintaining data consistency and integrity across distributed data lakes is crucial for decision-making and analytics. Inaccurate or outdated data can lead to flawed insights and business decisions. Businesses require synchronized data to gain actionable insights and respond swiftly to changing market conditions. Scalability is a critical concern for data lakes, because they need to accommodate growing volumes of data without compromising performance or incurring exorbitant costs.

To address these issues effectively, we propose using Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed Apache Kafka service that offers a seamless way to ingest and process streaming data. We use MSK connect—an AWS managed service to deploy and run Kafka Connect to build an end-to-end CDC application that uses Debezium MySQL connector to process, insert, update, and delete records from MySQL and a confluent Amazon Simple Storage Service (Amazon S3) sink connector to write to Amazon S3 as raw data that can be consumed by other downstream application for further use cases. To process batch data effectively, we use AWS Glue, a serverless data integration service that uses the Spark framework to process the data from S3 and copies the data to the open table format layer. Open table format manages large collections of files as tables and supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We chose Delta Lake as an example open table format, but you can achieve the same results using Apache Iceberg or Apache Hudi.

The post illustrates the construction of a comprehensive CDC system, enabling the processing of CDC data sourced from Amazon Relational Database Service (Amazon RDS) for MySQL. Initially, we’re creating a raw data lake of all modified records in the database in near real time using Amazon MSK and writing to Amazon S3 as raw data. This raw data can then be used to build a data warehouse or even a special type of data storage that’s optimized for analytics, such as a Delta Lake on S3. Later, we use an AWS Glue exchange, transform, and load (ETL) job for batch processing of CDC data from the S3 raw data lake. A key advantage of this setup is that you have complete control over the entire process, from capturing the changes in your database to transforming the data for your specific needs. This flexibility allows you to adapt the system to different use cases.

This is achieved through integration with MSK Connect using the Debezium MySQL connector, followed by writing data to Amazon S3 facilitated by the Confluent S3 Sink Connector. Subsequently, the data is processed from S3 using an AWS Glue ETL job, and then stored in the data lake layer. Finally, the Delta Lake table is queried using Amazon Athena.

Note: If you require real-time data processing of the CDC data, you can bypass the batch approach and use an AWS Glue streaming job instead. This job would directly connect to the Kafka topic in MSK, grabbing the data as soon as changes occur. It can then process and transform the data as needed, creating a Delta Lake on Amazon S3 that reflects the latest updates according to your business needs. This approach ensures you have the most up-to-date data available for real-time analytics.

Solution overview

The following diagram illustrates the architecture that you implement through this blog post. Each number represents a major component of the solution.

The workflow consists of the following:

Near real-time data capture from MySQL and streaming to Amazon S3
1. The process starts with data originating from Amazon RDS for
2. A Debezium connector is used to capture changes to the data in the RDS instance in near real time. Debezium is a distributed platform that converts information from your existing databases into event streams, enabling applications to detect and immediately respond to row-level changes in the databases. Debezium is built on top of Apache Kafka and provides a set of Kafka Connect compatible connectors.
3. The captured data changes are then streamed to an Amazon MSK topic. MSK is a managed service that simplifies running Apache Kafka on AWS.
4. The processed data stream (topic) is streamed from MSK to Amazon S3 in JSON format. The Confluent S3 Sink Connector allows near real-time data transfer from an MSK cluster to an S3 bucket.
Batch processing the CDC raw data and writing it into the data lake
1. Set up an AWS Glue ETL job to process the raw CDC
2. This job reads bookmarked data from an S3 raw bucket and writes into the data lake in open file format (Delta). The job also creates the Delta Lake table in AWS Glue Data Catalog.
3. Delta Lake is an open-source storage layer built on top of existing data lakes. It adds functionalities like ACID transactions and versioning to improve data reliability and manageability.
Analyze the data using serverless interactive query service
1. Athena, a serverless interactive query service, can be used to query the Delta Lake table created in Glue Data Catalog. This allows for interactive data analysis without managing infrastructure.

For this post, we create the solution resources in the us-east-1 AWS Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.

Configure resources with AWS CloudFormation

In this post, you use the following two CloudFormation templates. The advantage of using two different templates is that you can decouple the resource creation of the CDC pipeline and AWS Glue processing according to your use case, and if you have requirements to create specific process resources only.

vpc-msk-mskconnect-rds-client.yaml – This template sets up the CDC pipeline resources such as a virtual private cloud (VPC), subnet, security group, AWS Identity and Access Management (IAM) roles, NAT, internet gateway, Amazon Elastic Compute Cloud (Amazon EC2) client, Amazon MSK, MSKConnect, RDS, and S3
gluejob-setup.yaml – This template sets up the data processing resources such as the AWS Glue table, database and ETL

Configure MSK and MSK connect

To start, you’ll configure MKS and MSK connect using Debezium connector to capture incremental changes in table and write into Amazon S3 using an S3 sink connector. The vpc-msk-mskconnect-rds-client.yaml stack creates a VPC, private and public subnets, security groups, S3 buckets, Amazon MSK cluster, EC2 instance with Kafka client, RDS database, and MSK connectors, and its worker configurations.

Launch the stack vpc-msk-mskconnect-rds-client using the CloudFormation template:
Provide the parameter values as listed in the following

.	A	B	C
1	Parameters	Description	Sample value
2	`EnvironmentName`	An environment name that is prefixed to resource names.	msk-delta-cdc-pipeline
3	`DatabasePassword`	Database admin account password.	S3cretPwd99
4	`InstanceType`	MSK client EC2 instance type.	t2.micro
5	`LatestAmiId`	Latest AMI ID of Amazon Linux 2023 for EC2 instance. You can use the default value.	/aws/service/ami-amazon-linux- latest/al2023-ami-kernel-6.1-x86_64
6	`VpcCIDR`	IP range (CIDR notation) for this VPC.	10.192.0.0/16
7	`PublicSubnet1CIDR`	IP range (CIDR notation) for the public subnet in the first Availability Zone.	10.192.10.0/24
8	`PublicSubnet2CIDR`	IP range (CIDR notation) for the public subnet in the second Availability Zone.	10.192.11.0/24
9	`PrivateSubnet1CIDR`	IP range (CIDR notation) for the private subnet in the first Availability Zone.	10.192.20.0/24
10	`PrivateSubnet2CIDR`	IP range (CIDR notation) for the private subnet in the second Availability Zone.	10.192.21.0/24
11	`PrivateSubnet3CIDR`	IP range (CIDR notation) for the private subnet in the third Availability Zone.	10.192.22.0/24

The stack creation process can take approximately one hour to complete. Check the Outputs tab for the stack after the stack is created.

Next, you set up the AWS Glue data processing resources such as the AWS Glue database, table, and ETL job.

Implement UPSERT on an S3 data lake with Delta Lake using AWS Glue

The gluejob-setup.yaml CloudFormation template creates a database, IAM role, and AWS Glue ETL job. Retrieve the values for S3BucketNameForOutput, and S3BucketNameForScript from the vpc-msk-mskconnect-rds-client stack’s Outputs tab to use in this template. Complete the following steps:

Launch the stack gluejob-setup.
Provide parameter values as listed in the following

.	A	B	C
1	Parameters	Description	Sample value
2	`EnvironmentName`	Environment name that is prefixed to resource names.	gluejob-setup
3	`GlueDataBaseName`	Name of the Data Catalog database.	glue_cdc_blog_db
4	GlueTableName	Name of the Data Catalog table.	blog_cdc_tbl
5	`S3BucketForGlueScript`	Bucket name for the AWS Glue ETL script.	Use the S3 bucket name from the previous stack. For example, aws- gluescript-${AWS::AccountId}-${AWS::Region}-${EnvironmentNam e
6	`GlueWorkerType`	Worker type for AWS Glue job. For example, G.1X	G.1X
7	`NumberOfWorkers`	Number of workers in the AWS Glue job.	3
8	`S3BucketForOutput`	Bucket name for writing data from the AWS Glue job.	aws-glueoutput-${AWS::AccountId}-${AWS::Region}-${EnvironmentName}
9	`S3ConnectorTargetBucketname`	Bucket name where the Amazon MSK S3 sink connector writes the data from the Kafka topic.	msk-lab-${AWS::AccountId}- target-bucket

The stack creation process can take approximately 2 minutes to complete. Check the Outputs tab for the stack after the stack is created.

In the gluejob-setup stack, we created an AWS Glue database and AWS Glue job. For further clarity, you can examine the AWS Glue database and job generated using the CloudFormation template.

After successfully creating the CloudFormation stack, you can proceed with processing data using the AWS Glue ETL job.

Run the AWS Glue ETL job

To process the data created in the S3 bucket from Amazon MSK using the AWS Glue ETL job that you set up in the previous section, complete the following steps:

On the CloudFormation console, choose the stack gluejob-setup.
On the Outputs tab, retrieve the name of the AWS Glue ETL job from the GlueJobName In the following screenshot, the name is GlueCDCJob-glue-delta-cdc.

On the AWS Glue console, choose ETL jobs in the navigation pane.
Search for the AWS Glue ETL job named GlueCDCJob-glue-delta-cdc.
Choose the job name to open its details page.
Choose Run to start the On the Runs tab, confirm if the job ran without failure.

Retrieve the OutputBucketName from the gluejob-setup template output.
On the Amazon S3 console, navigate to the S3 bucket to verify the data.

Note: We have enabled AWS Glue job bookmark, which will make sure job will process the new data in each job run.

Query the Delta Lake table using Athena

After the AWS Glue ETL job has successfully created the Delta Lake table for the processed data in the Data Catalog, follow these steps to validate the data using Athena:

On the Athena console, navigate to the query editor.
Choose the Data Catalog as the data source.
Choose the database glue_cdc_blog_db created using gluejob-setup stack.
To validate the data, run the following query to preview the data and find the total count.

SELECT * FROM "glue_cdc_blog_db"."blog_cdc_tbl" ORDER BY cust_id DESC LIMIT 40;

SELECT COUNT(*) FROM "glue_cdc_blog_db"."blog_cdc_tbl";

The following screenshot shows the output of our example query.

Upload incremental (CDC) data for further processing

After we process the initial full load, let’s perform insert, update, and delete records in MySQL, which will be processed by the Debezium mysql connector and written to Amazon S3 using a confluent S3 sink connector.

On the Amazon EC2 console, go to the EC2 instance named KafkaClientInstance that you created using the CloudFormation template.

Run the following commands to insert the data into the RDS table. Use the database password from the CloudFormation stack parameter tab.

sudo su - ec2-user
RDS_AURORA_ENDPOINT=`aws rds describe-db-instances --region us-east-1 | jq -r '.DBInstances[] | select(.DBName == "salesdb") | .Endpoint.Address'`
mysql -f -u master -h $RDS_AURORA_ENDPOINT  --password

Now perform the insert into the CUSTOMER table.

use salesdb;
INSERT into CUSTOMER values(8887,'Customer Name 8887','Market segment 8887');
INSERT into CUSTOMER values(8888,'Customer Name 8888','Market segment 8888');
INSERT into CUSTOMER values(8889,'Customer Name 8889','Market segment 8889');

Run the AWS Glue job again to update the Delta Lake table with new records.
Use the Athena console to validate the data.

Perform the insert, update, and delete in the CUSTOMER table.

UPDATE CUSTOMER SET NAME='Customer Name update 8888',MKTSEGMENT='Market segment update 8888' where CUST_ID = 8888;
UPDATE CUSTOMER SET NAME='Customer Name update 8889',MKTSEGMENT='Market segment update 8889' where CUST_ID = 8889;
DELETE FROM CUSTOMER where CUST_ID = 8887;
INSERT into CUSTOMER values(9000,'Customer Name 9000','Market segment 9000');

Run the AWS Glue job again to update the Delta Lake table with the insert, update, and delete records.
Use the Athena console to validate the data to verify the update and delete records in the Delta Lake table.

Clean up

To clean up your resources, complete the following steps:

Delete the CloudFormation stack gluejob-setup.
Delete the CloudFormation stack vpc-msk-mskconnect-rds-client.

Conclusion

Organizations continually seek high-performance, cost-effective, and scalable analytical solutions to extract value from their operational data sources in near real time. The analytical platform must be capable of receiving updates to operational data as they happen. Traditional data lake solutions often struggle with managing changes in source data, but the Delta Lake framework addresses this challenge. This post illustrates the process of constructing an end-to-end change data capture (CDC) application using Amazon MSK, MSK Connect, AWS Glue, and native Delta Lake tables, alongside guidance on querying Delta Lake tables from Amazon Athena. This architectural pattern can be adapted to other data sources employing various Kafka connectors, enabling the creation of data lakes supporting UPSERT operations using AWS Glue and native Delta Lake tables. For further insights, see the MSK Connect examples.

About the authors

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru specializing in AWS Glue and Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Nitin Kumar is a Cloud Engineer (ETL) at AWS, specializing in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. He is committed to helping customers overcome ETL challenges and develop scalable data processing and analytics pipelines on AWS. In his free time, he likes to watch movies and spend time with his family.

StarTree overview

Improving security with SOC 2, SSO, and RBAC

Using automated data ingestion at scale

Tiered storage for real-time query processing

Improving scalability with off-heap upserts

Customer success stories using Pinot with StarTree for real-time analytics

Flexible deployment options for StarTree Cloud

Self-managed Pinot or StarTree

Conclusion

About the Authors

Key features of MSK Express brokers

Comparison with traditional Kafka deployment

Scaling use case example

Best practices

Conclusion

About the Author

High-level approach for governing streaming data in Amazon DataZone

Amazon DataZone catalog

Data source for Amazon MSK clusters with AWS Glue Schema registry

Custom authorization flow

Subscription target

Subscription grant process

Implement streaming governance in Amazon DataZone with DSF

Deployment steps

Verify the example is working

Subscribe

Clean up

Conclusion

About the Authors

Discover, prepare, and integrate all your data at any scale

Review the Gartner Magic Quadrant

About the authors

Planning your migration

Assessing the source cluster’s infrastructure and needs

Assessing the target cluster’s infrastructure and needs

Configuring Express Brokers

Client connectivity to the target cluster

Migration strategy: All at once vs. wave

Cutover plan

Schema registry

Solution overview

Deployment Steps

Provision the MSK cluster

Configure the MSK client

Create a topic

Produce and consume messages to and from the topic

Create an MSK replicator

Monitor replication

Migrate clients from source to target cluster

Migrate stateful applications

Migrate Kafka Streams and KSQL applications

Migrate Spark applications

Migrate Flink applications

Conclusion

About the Author

Serverless at re:Invent 2024

AWS Lambda and Amazon Elastic Container Service (Amazon ECS) 10-year anniversary.

AWS Lambda

Amazon ECS and AWS Fargate

Amazon EventBridge

AWS Step Functions

Amazon Kinesis

Amazon MQ

Amazon Finch

Amazon Simple Queue Service (SQS)

Amazon Managed Streaming for Apache Kafka(Amazon MSK)

AWS Amplify

AWS AppSync

Amazon API Gateway

Serverless blog posts

October

November

Serverless Office Hours

October

November

Still looking for more?

Solution overview

Prerequisites

Install Terraform on your client machine

Provision an MSK topic using Terraform