Tag Archives: Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Power your Kafka Streams application with Amazon MSK and AWS Fargate

2021-08-10 Karen Grygoryan

Post Syndicated from Karen Grygoryan original https://aws.amazon.com/blogs/big-data/power-your-kafka-streams-application-with-amazon-msk-and-aws-fargate/

Today, companies of all sizes across all verticals design and build event-driven architectures centered around real-time streaming and stream processing. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming and event data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can continue to use native Apache Kafka APIs to build event-driven architectures, stream changes to and from databases, and power machine learning and analytics applications.

You can apply streaming in a wide range of industries and organizations, such as to capture and analyze data from IoT devices, track and monitor vehicles or shipments, monitor patients in medical facilities, or monitor financial transactions.

In this post, we walk through how to build a real-time stream processing application using Amazon MSK, AWS Fargate, and the Apache Kafka Streams API. The Kafka Streams API is a client library that simplifies development of stream applications. Behind the scenes, Kafka Streams library is really an abstraction over the standard Kafka Producer and Kafka Consumer API. When you build applications with the Kafka Streams library, your data streams are automatically made fault tolerant, and are transparently and elastically distributed over the instances of the applications. Kafka Streams applications are supported by Amazon MSK. Fargate is a serverless compute engine for containers that works with AWS container orchestration services like Amazon Elastic Container Service (Amazon ECS), which allows you to easily run, scale, and secure containerized applications.

We have chosen to run our Kafka Streams application on Fargate, because Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design. Fargate allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity. You only pay for the resources required to run your containers, so there is no over-provisioning and paying for additional servers. Fargate runs each task or pod in its own kernel providing the tasks and pods their own isolated compute environment. This enables your application to have workload isolation and improved security by design.

Architecture overview

Our streaming application architecture consists of a stream producer, which connects to the Twitter Stream API, reads tweets, and publishes them to Amazon MSK. A Kafka Streams processor consumes these messages, performs window aggregation, pushes to topic result, and prints out to logs. Both apps are hosted on Fargate.

The stream producer application connects to the Twitter API (a stream of sample tweets), reads the stream of tweets, extracts only hashtags, and publishes them to the MSK topic. The following is a code snippet from the application:

   var configs = new AppConfig();
    var kafkaService = new KafkaService(configs.kafkaProducer());
    var twitterService = new TwitterService(kafkaService, configs.httpClient());
    if (null != BEARER_TOKEN) {
      twitterService.connectStream(BEARER_TOKEN);
    } else {
      LOG.error(
          "There was a problem getting you bearer token. Please make sure you set the BEARER_TOKEN environment variable");
    }

The MSK cluster is spread across three Availability Zones, with one broker per Availability Zone. We use the AWS-recommended (as of this writing) version of Apache Kafka 2.6.1. Apache Kafka topics have a replication factor and partitions of three, to take advantage of parallelism and resiliency.

The logic of our consumer streaming app is as follows; it counts the number of Twitter hashtags, with a minimum length of 1, that have been mentioned more than four times in a 20-second window:

private static final TimeWindows WINDOW_20_SEC = of(ofSeconds(20)).grace(ofMillis(0));
private static final int MIN_MENTIONED_IN_WINDOW = 4;
private static final int MIN_CHAR_LENGTH = 1;
…
var tweetStream =
    paragraphStream
        .filter(
            (k, v) -> v.length() > MIN_CHAR_LENGTH) // filter hashtags with length less 1 char
        .mapValues((ValueMapper<String, String>) String::toLowerCase) // lowercase hashtags
        .mapValues(String::trim) // remove leading and trailing spaces
        .selectKey((k, v) -> v) // select hashtag as a key
        .groupByKey()
        .windowedBy(WINDOW_20_SEC) // apply 20 seconds window aggregation
        .count(with(String(), Long())) // count hashtags, materialized in state store as String & Long
        .suppress(untilWindowCloses(unbounded())) // suppression will emit only the "final results", buffer unconstrained by size(not recommended for prod)
        .toStream()
        .map((k, v) -> new KeyValue<>(k.key(), v))
        .filter(
            (k, v) -> v > MIN_MENTIONED_IN_WINDOW); // filter hashtags mentioned less than 4 times

Prerequisites

Make sure to complete the following steps as prerequisites:

Create an AWS account. For this post, you configure the required AWS resources in the us-east-1 or us-west-2 Region. If you haven’t signed up, complete the following tasks:
1. Create an account. For instructions, see Sign Up for AWS.
2. Create an AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
Have a Bearer Token associated with your Twitter app. To create a developer account, see Get started with the Twitter developer platform.
Install Docker on your local machine.

Solution overview

To implement this solution, we complete the following steps:

Set up an MSK cluster and Amazon Elastic Container Registry (Amazon ECR).
Build and upload application JAR files to Amazon ECR.
Create an ECS cluster with a Fargate task and service definitions.
Run our streaming application.

Set up an MSK cluster and Amazon ECR

Use the provided AWS CloudFormation template to create the VPC (with other required network components), security groups, MSK cluster with required Kafka topics (twitter_input and twitter_output), and two Amazon ECR repositories, one per each application.

Build and upload application JAR files to Amazon ECR

To build and upload the JAR files to Amazon ECR, complete the following steps:

Download the application code from the GitHub repo.
Build the applications by running the following command in the root of the project:

./gradlew clean build

Create your Docker images (kafka-streams-msk and twitter-stream-producer):

docker-compose build

Retrieve an authentication token and authenticate your Docker client to your registry. Use the following AWS Command Line Interface (AWS CLI) code:

aws ecr get-login-password --region <<region>> | docker login --username AWS --password-stdin <<account_id>>.dkr.ecr.<<region>>.amazonaws.com

Tag and push your images to the Amazon ECR repository:

docker tag kafka-streams-msk:latest  <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/kafka-streams-msk:latest 
docker tag twitter-stream-producer:latest  <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/twitter-stream-producer:latest

Run the following command to push images to your Amazon ECR repositories:

docker push <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/kafka-streams-msk:latest 
docker push <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/twitter-stream-producer:latest

Now you should see images in your Amazon ECR repository (see the following screenshot).

Create an ECS cluster with a Fargate task and service definitions

Use the provided CloudFormation template to create your ECS cluster, Fargate task, and service definitions. Make sure to have Twitter API Bearer Token ready.

Run the streaming application

When the CloudFormation stack is complete, it automatically deploys your applications. After approximately 10 minutes, all your apps should be up and running, aggregating, and producing results. You can see the result in Amazon CloudWatch logs or by navigating to the Logs tab of the Fargate task.

Improvements, considerations, and best practices

Consider the following when implementing this solution:

Fargate enables you to run and maintain a specified number of instances of a task definition simultaneously in a cluster. If any of your tasks should fail or stop for any reason, the Fargate scheduler launches another instance of your task definition to replace it in order to maintain the desired number of tasks in the service. Fargate is not recommended for workloads requiring privileged Docker permissions or workloads requiring more than 4v CPU or 30 Gb of memory (consider whether you can break up your workload into more, smaller containers that each use fewer resources).
Kafka Streams resiliency and availability is provided by state stores. These state stores can either be an in-memory hash map (as used in this post), or another convenient data structure (for example, a RocksDB database that is production recommended). The Kafka Streams application may embed more than one local state store that can be accessed via APIs to store and query data required for processing. In addition, Kafka Streams makes sure that the local state stores are robust to failures. For each state store, it maintains a replicated changelog Kafka topic in which it tracks any state updates. If your app restarts after a crash, it replays the changelog Kafka topic and recreates an in-memory state store.
The AWS Glue Schema Registry is out of scope for this post, but should be considered in order to centrally discover, validate, and control the evolution of streaming data using registered Apache Avro schemas. Some of the benefits that come with it are data policy enforcement, data discovery, controlled schema evolution, and fault-tolerant streaming (data) pipelines.
To improve availability, enable three (the maximum as of this writing) Availability Zone replications within a Region. Amazon MSK continuously monitors cluster health, and if a component fails, Amazon MSK automatically replaces it.
When you enable three Availability Zones your MSK cluster, you not only improve availability, but also improve cluster performance. You spread the load between a larger number of brokers, and can add more partitions per topic.
We highly encourage you to enable encryption at rest, TLS encryption in transit (client-to-broker, broker-to-broker), TLS based certificate authentication, and SASL/SCRAM authentication, which can be secured by AWS Secrets Manager.

Clean up

To clean up your resources, delete the CloudFormation stacks that you launched as part of this post. You can delete these resources via the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI).

Conclusion

In this post, we demonstrated how to build a scalable and resilient real-time stream processing application. We build the solution using the Kafka Streams API, Amazon MSK, and Fargate. We also discussed improvements, considerations, and best practices. You can use this architecture as a reference in your migrations or new workloads. Try it out and share your experience in the comments!

About the Author

Karen Grygoryan, Data Architect, AWS ProServe

Secure connectivity patterns to access Amazon MSK across AWS Regions

2021-08-03 Sam Mokhtari

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/big-data/secure-connectivity-patterns-to-access-amazon-msk-across-aws-regions/

AWS customers often segment their workloads across accounts and Amazon Virtual Private Cloud (Amazon VPC) to streamline access management while being able to expand their footprint. As a result, in some scenarios you, as an AWS customer, need to make an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster accessible to Apache Kafka clients not only in the same Amazon VPC as the cluster but also in a remote Amazon VPC. A guest post by Goldman Sachs presented cross-account connectivity patterns to an MSK cluster using AWS PrivateLink. Inspired by the work of Goldman Sachs, this post demonstrates additional connectivity patterns that can support both cross-account and cross-Region connectivity to an MSK cluster. We also developed sample code that supports the automation of the creation of resources for the connectivity pattern based on AWS PrivateLink.

Overview

Amazon MSK makes it easy to run Apache Kafka clusters on AWS. It’s a fully managed streaming service that automatically configures, and maintains Apache Kafka clusters and Apache Zookeeper nodes for you. Amazon MSK lets you focus on building your streaming solutions and supports familiar Apache Kafka ecosystem tools (such as MirrorMaker, Kafka Connect, and Kafka streams) and helps avoid the challenges of managing the Apache Kafka infrastructure and operations.

If you have workloads segmented across several VPCs and AWS accounts, there may be scenarios in which you need to make Amazon MSK cluster accessible to Apache Kafka clients across VPCs. To provide secure connection between resources across multiple VPCs, AWS provides several networking constructs. Let’s get familiar with these before discussing the different connectivity patterns:

Amazon VPC peering is the simplest networking construct that enables bidirectional connectivity between two VPCs. You can use this connection type to enable between VPCs across accounts and AWS Regions to communicate with each other using private IP addresses.
AWS Transit Gateway provides a highly available and scalable design for connecting VPCs. Unlike VPC peering that can go cross-Region, AWS Transit Gateway is a regional service, but you can use inter-Region peering between transit gateways to route traffic across Regions.

AWS PrivateLink is an AWS networking service that provides private access to a specific service instead of all resources within a VPC and without traversing the public internet. You can use this service to expose your own application in a VPC to other users or applications in another VPC via an AWS PrivateLink-powered service (referred to as an endpoint service). Other AWS principals can then create a connection from their VPC to your endpoint service using an interface VPC endpoint.

Amazon MSK networking

When you create an MSK cluster, either via the AWS Management Console or AWS Command Line Interface (AWS CLI), it’s deployed into a managed VPC with brokers in private subnets (one per Availability Zone) as shown in the following diagram. Amazon MSK also creates the Apache ZooKeeper nodes in the same private subnets.

The brokers in the cluster are made accessible to clients in the customer VPC through elastic network interfaces (ENIs) that appear in the customer account. The security groups on the ENIs dictate the source and type of ingress and egress traffic allowed on the brokers.

IP addresses from the customer VPC are attached to the ENIs, and all network traffic stays within the AWS network and is not accessible to the internet.

Connections between clients and an MSK cluster are always private.

This blog demonstrates four connectivity patterns to securely access an MSK cluster from a remote VPC. The following table lists these patterns and their key characteristics. Each pattern aligns with the networking constructs discussed earlier.

	VPC Peering	AWS Transit Gateway	AWS PrivateLink with a single NLB	WS PrivateLink multiple NLB
Bandwidth	Limited by instance network performance and flow limits.	Up to 50 Gbps	10 Gbps per AZ	10 Gbps per AZ
Pricing	Data transfer charge (free if data transfer is within AZs)	Data transfer charge + hourly charge per attachment	Data transfer charge + interface endpoint charge + Network load balancer charge	Data transfer charge + interface endpoint charge + Network load balancer charge
Scalability	Recommended for smaller number of VPCs	No limit on number of VPCs	No limit on number of VPCs	No limit on number of VPCs

Let’s explore these connectivity options in more detail.

VPC peering

To access an MSK cluster from a remote VPC, the first option is to create a peering connection between the two VPCs.

Let’s say you use Account A to provision an MSK cluster in us-east-1 Region, as shown in the following diagram. Now, you have an Apache Kafka client in the customer VPC in Account B that needs to access this MSK cluster. To enable this connectivity, you just need to create a peering connection between the VPC in Account A and the VPC in Account B. You should also consider implementing fine-grained network access controls with security groups to make sure that only specific resources are accessible between the peered VPCs.

Because VPC peering works across Regions, you can extend this architecture to provide access to Apache Kafka clients in another Region. As shown in the following diagram, to provide access to Kafka clients in the VPC of Account C, you just need to create another peering connection between the VPC in Account C with the VPC in Account A. The same networking principles apply to make sure only specific resources are reachable. In the following diagram, a solid line indicates a direct connection from the Kafka client to MSK cluster, whereas a dotted line indicates a connection flowing via VPC peering.

VPC peering has the following benefits:*

Simplest connectivity option.
Low latency.
No bandwidth limits (it is just limited by instance network performance and flow limits).
Lower overall cost compared to other VPC-to-VPC connectivity options.

However, it has some drawbacks:

VPC peering doesn’t support transitive peering, which means that only directly peered VPCs can communicate with each other.
You can’t use this connectivity pattern when there are overlapping IPv4 or IPv6 CIDR blocks in the VPCs.
Managing access can become challenging as the number of peered VPCs grows.

You can use VPC peering when the number of VPCs to be peered is less than 10.

AWS Transit Gateway

AWS Transit Gateway can provide scalable connectivity to MSK clusters. The following diagram demonstrates how to use this service to provide connectivity to MSK cluster. Let’s again consider a VPC in Account A running an MSK cluster, and an Apache Kafka client in a remote VPC in Account B is looking to connect to this MSK cluster. You set up AWS Transit Gateway to connect these VPCs and use route tables on the transit gateway to control the routing.

To extend this architecture to support access from a VPC in another Region, you need to use another transit gateway because this service can’t span Regions. In other words, for the Apache Kafka client in Account C in us-west-2 to connect to the MSK cluster, you need to peer another transit gateway in us-west-2 with the transit gateway in us-east-1 and work with the route tables to manage access to the MSK cluster. If you need to connect another account in us-west-2, you don’t need an additional transit gateway. The Apache Kafka clients in the new account (Account D) simply require a connection to the existing transit gateway in us-west-2 and the appropriate route tables.

The hub and spoke model for AWS Transit Gateway simplifies management at scale because VPCs only need to connect to one transit gateway per Region to gain access to the MSK cluster in the attached VPCs. However, this setup has some drawbacks:

Unlike VPC peering in which you only pay for data transfer charges, Transit Gateway has an hourly charge per attachment in addition to the data transfer fee.
This connectivity pattern doesn’t support transitive routing.
Unlike VPC peering, Transit Gateway is an additional hop between VPCs which may cause more latency.
It has higher latency (an additional hop between VPCs) comparing to VPC Peering.
The maximum bandwidth (burst) per Availability Zone per VPC connection is 50 Gbps.

You can use AWS Transit Gateway when you need to provide scalable access to the MSK cluster.

AWS PrivateLink

To provide private, unidirectional access from an Apache Kafka client to an MSK cluster across VPCs, you can use AWS PrivateLink. This also eliminates the need to expose the entire VPC or subnet and prevents issues like having to deal with overlapping CIDR blocks between the VPC that hosts the MSK cluster ENIs and the remote Apache Kafka client VPC.

Let’s do a quick recap of the architecture as explained in blog post How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.

Let’s assume Account A has a VPC with three private subnets and an MSK cluster with three broker nodes in a 3-AZ deployment. You have three ENIs, one for each broker node in each subnet representing the broker nodes, and each ENI gets a private IPv4 address from its subnet’s CIDR block, and an MSK broker DNS endpoint. To expose the MSK cluster in Account A to other accounts via AWS PrivateLink, you have to create a VPC endpoint service in Account A. The VPC endpoint service requires the entity, in this case the MSK cluster, to be fronted by a Network Load Balancer (NLB).

You can choose from two patterns using AWS PrivateLink to provide cross-account access to Amazon MSK: with a single NLB or multiple NLBs.

AWS PrivateLink connectivity pattern with a single NLB

The following diagram illustrates access to an MSK cluster via an AWS PrivateLink connectivity pattern with a single NLB.

In this pattern, you have a single dedicated internal NLB in Account A. The NLB has a separate listener for each MSK broker. Because this pattern has a single NLB endpoint, each of the listeners need to listen on unique port. In the preceding diagram, the ports are depicted as 8443, 8444, and 8445. Correspondingly, for each listener, you have a unique target group, each of which has a single registered target: the IP address of an MSK broker ENI. Because the ports are different from the advertised listeners defined in the MSK cluster for each of the broker nodes, the advertised listeners configuration for each of the broker nodes in the cluster should be updated. Additionally, one target group has all the broker ENI IPs as targets and a corresponding listener (on port 9094), which means a request coming to the NLB on port 9094 can be routed to any of the MSK brokers.

In Account B, you need to create a corresponding VPC endpoint for the VPC endpoint service in Account A. Apache Kafka clients in Account B can connect to the MSK cluster in Account B by directing their requests to the VPC endpoint. For Transport Layer Security (TLS) to work, you also need an Amazon Route 53 private hosted zone with the domain name kafka.<region of the amazon msk cluster>.amazonaws.com, with alias resource record sets for each of the broker endpoints pointing to the VPC endpoint in Account B.

In this pattern, for the Apache Kafka clients local to the VPC with the Amazon MSK broker ENIs in Account A to connect to the MSK cluster, you need to set up a Route 53 private hosted zone, similar to Account B, with alias resource record sets for each of the broker endpoints pointing to the NLB endpoint. This is because the ports in the advertised.listener configuration have been changed for the brokers and the default Amazon MSK broker endpoints won’t work.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection (which can be via VPC peering or AWS Transit Gateway) between the VPC in Account B and the VPC in the remote Region. The same networking principles apply to make sure only specific intended resources are reachable.

AWS PrivateLink connectivity pattern with multiple NLBs

In the second pattern, you don’t share one VPC endpoint service or NLB across multiple MSK brokers. Instead, you have an independent set for each broker. Each NLB has only one listener listening on the same port (9094) for requests to each Amazon MSK broker. Correspondingly, you have a separate VPC endpoint service for each NLB and each broker. Just like in the first pattern, in Account B, you need a Route53 hosted private zone to alias broker DNS endpoints to VPC endpoints—in this case, they’re aliased to their own specific VPC endpoint.

This pattern has the advantage of not having to modify the advertised listeners configuration in the MSK cluster. However, there is an additional cost of deploying more NLBs, one for each broker. Furthermore, in this pattern, Apache Kafka clients that are local to the VPC with the MSK broker ENIs in Account A can connect to the cluster as usual with no additional setup needed. The following diagram illustrates this setup.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection between the VPC in Account B and the VPC in the remote Region.

You can use the sample code provided on GitHub to set up the AWS PrivateLink connectivity pattern with multiple NLBs for an MSK cluster. The intent of the code is to automate the creation of multiple resources instead of wiring it manually.

These patterns have the following benefits:

They are scalable solutions and do not limit the number of consumer VPCs.
AWS PrivateLink allows for VPC CIDR ranges to overlap.
You don’t need path definitions or a route table (access only to the MSK cluster), therefore it’s easier to manage

The drawbacks are as follows:

The VPC endpoint and service must be in the same Region.
The VPC endpoints support IPv4 traffic only.
The endpoints can’t be transferred from one VPC to another.

You can use either connectivity pattern when you need your solution to scale to a large number of Amazon VPCs that can consume each service. You can also use either pattern when the cluster and client VPCs have overlapping IP addresses and when you want to restrict access to only the MSK cluster instead of the VPC itself. The single NLB pattern adds relevant complexity to the architecture because you need to maintain an additional target group and listener that has all brokers registered as well as keep the advertised.listeners property up to date. You can offset that complexity with the multiple NLB pattern but at an additional cost for the increased number of NLBs.

Conclusion

In this post, we explored different secure connectivity patterns to access an MSK cluster from a remote VPC. We also discussed the advantages, challenges, and limitations of each connectivity pattern. You can use this post as guidance to help you identify an appropriate connectivity pattern to address your requirements for accessing an MSK cluster. You can also use a combination of connectivity patterns to address your use case.

References

To read more about the solutions that inspired this post, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink and the webinar Cross-Account Connectivity Options for Amazon MSK.

About the Authors

Dr. Sam Mokhtari is a Senior Solutions Architect in AWS. His main area of depth is data and analytics, and he has published more than 30 influential articles in this field. He is also a respected data and analytics advisor who led several large-scale implementation projects across different industries including energy, health, telecom, and transport.

Pooja Chikkala is a Solutions Architect in AWS. Big data and analytics is her area of interest. She has 13 years of experience leading large-scale engineering projects with expertise in designing and managing both on-premises and cloud-based infrastructures.

Rajeev Chakrabarti is a Principal Developer Advocate with the Amazon MSK team. He has worked for many years in the big data and data streaming space. Before joining the Amazon MSK team, he was a Streaming Specialist SA helping customers build streaming pipelines.

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics, and can be reached at IN.

Field Notes: Deliver Messages Using an IoT Rule Action to Amazon Managed Streaming for Apache Kafka

2021-07-30 Siddhesh Keluskar

Post Syndicated from Siddhesh Keluskar original https://aws.amazon.com/blogs/architecture/field-notes-deliver-messages-using-an-iot-rule-action-to-amazon-managed-streaming-for-apache-kafka/

With IoT devices scaling up rapidly, real-time data integration and data processing has become a major challenge. This is why customers often choose Message Queuing Telemetry Transport (MQTT) for message ingestion, and Apache Kafka to build a real-time streaming data pipeline. AWS IoT Core now supports a new IoT rule action to deliver messages from your devices directly to your Amazon MSK or self-managed Apache Kafka clusters for data analysis and visualization, without you having to write a single line of code.

In this post, you learn how to set up a real-time streaming data pipeline for IoT data using AWS IoT Core rule and Amazon Managed Streaming for Apache Kafka. The audience for this post is architects and developers creating solutions to ingest sensor data, and high-volume high-frequency streaming data, and process it using a Kafka cluster. Also, this blog describes the SASL_SSL (using user name and password) method to access your Kafka cluster.

Overview of solution

Figure 1 represents an IoT data ingestion pipeline where multiple IoT devices connect to AWS IoT Core. These devices can send messages to AWS IoT Core over MQTT or HTTPS protocol. AWS IoT Core rule for Kafka is configured to intercept messages from the desired topic and route them to the Apache Kafka cluster. These messages can then be received by multiple consumers connected to the Kafka cluster. In this post, we will use AWS Python SDK to represent IoT devices and publish messages.

Figure 1 – Architecture representing an IoT ingestion pipeline

Prerequisites

Familiarity with MQTT protocol and Kafka command line interface (CLI)
AWS account with console access

Walkthrough

I will show you how to stream AWS IoT data on an Amazon MSK cluster using AWS IoT Core rules and SASL_SSL SCRAM-SHA-512 mechanism of authentication. Following are the steps for this walkthrough:

Create an Apache Kafka cluster using Amazon MSK.
Configure an Apache Kafka cluster for SASL_SSL authentication.
Set up a Kafka producer and consumer on AWS Cloud9 to test the setup.
Configure an IoT Rule action to send a message to Kafka.

1. Create an Apache Kafka cluster using Amazon MSK

The first step is to create an Apache Kafka cluster. Open the service page for Amazon MSK by signing in to your AWS account.
Choose Create Cluster, and select Custom Create. AWS IoT Core supports SSL and SASL_SSL based authentication for Amazon MSK. We are using custom settings to configure these authentication methods.

Figure 2 – Screenshot showing how to create an MSK cluster.

Assign a cluster name, and select Apache Kafka (version of your choice), for this walkthrough, we are using 2.6.1.
Keep the configuration as Amazon MSK default configuration. Choose your Networking components: VPC, number of Availability Zones (a minimum of two is required for high availability), and subnets.
Choose SASL/SCRAM authentication (default selection is None).

Use the encryption settings as shown in the following screenshot:

Figure 3 – Screenshot showing Encryption Settings

Keep the monitoring settings as Basic Monitoring, and Choose Create Cluster.
It takes approximately 15–20 minutes for the cluster to be created.

2. Configure an Apache Kafka cluster for SASL_SSL authentication

When the Apache Kafka cluster is available, we must then configure authentication for producers and consumers.
Open AWS Secrets Manager, choose Store a new secret, and then choose Other type of secrets.
Enter user name and password as two keys, and assign the user name and password values of your choice.

Figure 5 - Screenshot showing how to store a new secret

Figure 4 – Screenshot showing how to store a new secret

Next, select Add new key link.
Note: Do not select DefaultEncryptionKey! A secret created with the default key cannot be used with an Amazon MSK cluster. Only a Customer managed key can be used as an encryption key for an Amazon MSK–compatible secret.
To add a new key, select Create key, select Symmetric key, and choose Next.
Type an Alias, and choose Next.
Select appropriate users as Key administrators, and choose Next.
Review the configuration, and select Finish.

Figure 6 - Select the newly-created Customer Managed Key as the encryption key

Figure 5 – Select the newly-created Customer Managed Key as the encryption key

Figure 7 - Specify the key value pais to be stored in this secret

Figure 6 – Specify the key value pair to be stored in this secret

Select the newly-created Customer Managed Key as the encryption key, and choose Next.
Provide a Secret name (Secret name must start with AmazonMSK_ for Amazon MSK cluster to recognize it), for example, AmazonMSK_SECRET_NAME.
Choose Next twice, and then choose Store.

Select the newly-created Customer Managed Key as the encryption key, and choose Next. Provide a Secret name (Secret name must start with AmazonMSK_ for Amazon MSK cluster to recognize it) (for example, AmazonMSK_SECRET_NAME). Choose Next twice, and then choose Store.

Figure 7 – Storing a new secret

Open the Amazon MSK service page, and select your Amazon MSK cluster. Choose Associate Secrets, and then select Choose secrets (this will only be available after the cluster is created and in Active Status).
Choose the secret we created in the previous step, and choose Associate secrets. Only the secret name starting with AmazonMSK_ will be visible.

3. Set up Kafka producer and consumer on AWS Cloud9 to test the setup

To test if the cluster and authentication is correctly setup, we use Kafka SDK on AWS Cloud9 IDE.
Choose Create environment, and follow the console to create a new AWS Cloud9 environment. You can use an existing AWS Cloud9 environment, in addition to an environment with Kafka consumer and producer already configured.
This blog requires Java 8 or earlier.
Verify your version of Java with the command: java -version. Next, add your AWS Cloud9 instance Security Group to inbound rules of your Kafka cluster.
Open the Amazon MSK page and select your cluster, then choose Security groups applied.

Figure 9 - Selecting Security Groups Applied

Figure 8 – Selecting Security Groups Applied

Next, choose Inbound rules, and then choose Edit inbound rules.
Choose Add rule, and add Custom TCP ports 2181 and 9096 with Security Group of your AWS Cloud9 instance.

Figure 10 - Screenshot showing rules applied

Figure 9 – Screenshot showing rules applied

The Security Group for your AWS Cloud9 can be found in the Environment details section of your AWS Cloud9 instance.

Figure 11 - Screenshot showing Edit Inbound Rules added

Figure 10 – Screenshot showing Edit Inbound Rules added

Use Port range values as per the client information section of your Bootstrap server and Zookeeper connection.

Figure 12 - Screesnhot showing where to access 'View client information'

Figure 11 – Screenshot showing where to access ‘View client information’

Figure 13 - Screesnhot showing client integration information

Figure 12 – Screenshot showing client integration information

Invoke the following commands on AWS Cloud9 console to download and extract Kafka CLI tools:

wget https://archive.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz
tar -xzf kafka_2.12-2.2.1.tgz
cd kafka_2.12-2.2.1/
mkdir client && cd client

Next, create a file users_jass.conf, and add the user name and password that you added in Secrets Manage:

sudo nano users_jaas.conf

Paste the following configuration and save. Verify the user name and passwords are the same as saved in Secrets Manager.

KafkaClient {
   org.apache.kafka.common.security.scram.ScramLoginModule required
   username="hello"
   password="world";
};

Invoke the following commands:

export KAFKA_OPTS=-Djava.security.auth.login.config=$PWD/users_jaas.conf

Create a new file with name client_sasl.properties.

sudo nano client_sasl.properties

Copy the following content to file:

security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
ssl.truststore.location=<path-to-keystore-file>/kafka.client.truststore.jks

<path-to-keystore-file> can be retrieved by running following command:

cd ~/environment/kafka_2.12-2.2.1/client
echo $PWD

Next, copy the cacerts file from your Java lib folder to client folder. The path of Java lib folder might be different based on your version of Java.

cd ~/environment/kafka_2.12-2.2.1/client
cp /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib/security/cacerts kafka.client.truststore.jks

Figure 14 - Screenshot showing client integration information

Figure 13 – Screenshot showing client integration information

Save the previous endpoints as BOOTSTRAP_SERVER and ZOOKEEPER_STRING.

export BOOTSTRAP_SERVER=b-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096,b-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096
export ZOOKEEPER_STRING=z-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181,z-3.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181,z-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181

Save the Topic name in an environment variable.

TOPIC="AWSKafkaTutorialTopic"

Next, create a new Topic using the Zookeeper String.

cd ~/environment/kafka_2.12-2.2.1
bin/kafka-topics.sh --create --zookeeper $ZOOKEEPER_STRING --replication-factor 2 --partitions 1 --topic $TOPIC

Confirm that you receive the message: Created topic AWSKafkaTutorialTopic.
Start Kafka producer by running this command in your Kafka folder:

cd ~/environment/kafka_2.12-2.2.1

bin/kafka-console-producer.sh --broker-list $BOOTSTRAP_SERVER --topic $TOPIC --producer.config client/client_sasl.properties

Next, open a new Terminal by pressing the + button, and initiate the following commands to configure the environment variables:

export BOOTSTRAP_SERVER=b-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096,b-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096
TOPIC="AWSKafkaTutorialTopic"

cd ~/environment/kafka_2.12-2.2.1/client
export KAFKA_OPTS=-Djava.security.auth.login.config=$PWD/users_jaas.conf

cd ~/environment/kafka_2.12-2.2.1/
bin/kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --topic $TOPIC --from-beginning --consumer.config client/client_sasl.properties --from-beginning

Now that you have a Kafka consumer and producer opened side-by-side, you can type in producer terminal and verify it from the consumer terminal.

Now that you have a Kafka consumer and producer opened side-by-side, you can type in producer terminal and verify it from the consumer terminal.

Figure 14 – Screenshot showing Kafka consumer and producer opened side-by-side

4. Configure an IoT Rule action to send a message to Kafka

Create an AWS Identity and Access Management (IAM) role with SecretsManager permissions to allow IoT rule to access Kafka KeyStore in AWS Secrets Manager.
Sign in to IAM, select Policies from the left-side panel, choose Create policy.
Select Choose a service, and search for AWS KMS.
In Actions, choose All AWS KMS actions. Select All resources in the Resources section, and choose Next.
Name the policy KMSfullAccess, and choose Create policy.
Select Roles from the left-side panel, choose Create Role, then select EC2 from Choose a use case, and choose Next:Permissions.
Assign the policy SecretsManagerReadWrite. Note: if you do not select EC2, SecretsManager Policy will be unavailable.
Search for and select SecretsManagerReadWrite and KMSfullAccess Policy.

Add tags, type Role name as kafkaSASLRole, and choose Create Role.

After the Role is created, search the newly-created Role name to view the Summary of the role.
Choose the Trust relationships tab, and choose Edit trust relationship.

Enter the following trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "iot.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Choose Update Trust Policy.
Next, create a new AWS IoT Core rule by signing in to the AWS IoT Core service. Choose Act from the left side-menu, and select Rules.
Choose Create. Insert details for Name, Description, and Rule query statement, and then choose Add action. The following query is used for this post:
SELECT * from ‘iot/topic’
Select Send a message to an Apache Kafka cluster. Next, choose Configure action.

Figure 15 – Screenshot to create a rule

Create a VPC destination (if you do not already have one).

Figure 16 – How to Create a VPC destination

Create a VPC destination (if you do not already have one).
Select the VPC ID of your Kafka cluster. Select a Security Group with access to Kafka cluster Security Group.
Choose security group settings of the EC2 instance we created, or the security group of Kafka cluster.
Choose Create Role, and then select Create Destination. It takes approximately 5–10 minutes for the Destination to be Enabled. After the status is Enabled, navigate back to the Rule creation page and select the VPC Destination.
Enter AWSKafkaTutorialTopic as Kafka topic (confirm there is no extra space after the topic name, or you will get an error). Do not update Key and Partition boxes.

Figure 17 – Screenshot showing how to enter the AWSKafkaTutorialTopic
Verify the Security Group of your VPC destination is added to the inbound list for your Kafka cluster.

Figure 18 - Showing Inbound list for Kafka Cluster

Figure 18 – Showing Security Group for Kafka Cluster

Figure -Screenshot showing Inbound Inbound rules

Figure 19 -Screenshot showing Inbound Inbound rules

The first two Custom TCP entries are for AWS Cloud9 security group. The last two entries are for VPC endpoint.

Set the Client properties as follows:

Bootstrap.server = The TLS bootstrap string for Kafka cluster

security.protocol = SASL_SSL

ssl.truststore = EMPTY for Amazon MSK, enter SecretBinary template for self-managed Kafka

ss.truststore.password = EMPTY for Amazon MSK, enter truststore password for self-managed Kafka

sasl.mechanism = SCRAM-SHA-512

Replace the secret name with your stored secret name starting with AmazonMSK_, replace the IAM role ARN with your IAM role ARN.
The secret and IAM role are created in previous steps of this post. Enter the following template in the sasl.scram.username field to retrieve username from Secrets Manager.

${get_secret('AmazonMSK_cluster_secret','SecretString','username','arn:aws:iam::318219976534:role/kafkaSASLRole')}

Perform a similar step for sasl.scram.password field:

${get_secret('AmazonMSK_cluster_secret','SecretString','password','arn:aws:iam::318219976534:role/kafkaSASLRole')}

Choose Add action.
Choose Create rule.

Testing the data pipeline

Open MQTT test client from AWS IoT Core page.
Publish the message to the MQTT topic that you configured while creating the rule.
Keep the consumer session active (created in earlier step). You will see data published on the MQTT topic being streamed to Kafka consumer.

Figure 20 – Screenshot showing testing the data pipeline

Common troubleshooting checks

Confirm that your:

AWS Cloud9 Security Group is added to Amazon MSK Security Group Inbound rule
VPC endpoint Security Group is added to Amazon MSK Security Group Inbound rule
Topic is created in the Kafka cluster
IAM role has Secrets Manager and KMS permissions
Environment variables are correctly configured in terminal
Folder paths have been correctly followed

Cleaning up

To avoid incurring future changes, delete the following resources:

Amazon MSK cluster
AWS IoT Core rule
IAM role
Secrets Manager Secret
AWS Cloud9 instance

Conclusion

In this post, I showed you how to configure an IoT Rule Action to deliver messages to Apache Kafka cluster using AWS IoT Core and Amazon MSK. You can now build a real-time streaming data pipeline by securely delivering MQTT messages to a highly-scalable, durable, and reliable system using Apache Kafka.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Increase Apache Kafka’s resiliency with a multi-Region deployment and MirrorMaker 2

2021-06-17 Anusha Dharmalingam

Post Syndicated from Anusha Dharmalingam original https://aws.amazon.com/blogs/big-data/increase-apache-kafkas-resiliency-with-a-multi-region-deployment-and-mirrormaker-2/

Customers create business continuity plans and disaster recovery (DR) strategies to maximize resiliency for their applications, because downtime or data loss can result in losing revenue or halting operations. Ultimately, DR planning is all about enabling the business to continue running despite a Regional outage. This post explains how to make Apache Kafka resilient to issues that span more than a single Availability Zone using a multi-Region Apache Kafka architecture. We use Apache Kafka deployed as Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters in this example, but the same architecture also applies to self-managed Apache Kafka.

Amazon MSK is a fully managed service that makes it easy for you to build and run Apache Kafka to process streaming data. Amazon MSK provides high availability by offering Multi-AZ configurations to distribute brokers across multiple Availability Zones within an AWS Region. A single MSK cluster deployment provides message durability through intra-cluster data replication. Data replication with a replication factor of 3 and “min-ISR” value of 2 along with the producer setting acks=all provides the strongest availability guarantees, because it ensures that other brokers in the cluster acknowledge receiving the data before the leader broker responds to the producer. This design provides robust protection against single broker failure as well as Single-AZ failure. However, if an unlikely issue was impacting your applications or infrastructure across more than one Availability Zone, the architecture outlined in this post can help you prepare, respond, and recover from it.

For companies that can withstand a longer time to recover (Recovery Time Objective, RTO) but are sensitive to data loss on Amazon MSK (Recovery Point Objective, RPO), backing up data to Amazon Simple Storage Service (Amazon S3) and recovering the data from Amazon S3 is sufficient as a DR plan. However, most streaming use cases rely on the availability of the MSK cluster itself for your business continuity plan, and you may want a lower RTO as well. In these cases, setting up MSK clusters in multiple Regions and configuring them to replicate data from one cluster to another provides the required business resilience and continuity.

MirrorMaker

MirrorMaker is a utility bundled as part of Apache Kafka, which helps replicate the data between two Kafka clusters. MirrorMaker is essentially a Kafka high-level consumer and producer pair, efficiently moving data from the source cluster to the destination cluster. Use cases for MirrorMaker include aggregating data to a central cluster for analytics, isolating data based on use case, geo-proximity, migrating data from one Kafka cluster to another, and for highly resilient deployments.

In this post, we use MirrorMaker v2 (MM2), which is available as part of Apache Kafka version 2.4 onwards, because it enables us to sync topic properties and also sync offset mappings across clusters. This feature helps us migrate consumers from one cluster to another because the offsets are synced across clusters.

Solution overview

In this post, we dive into the details of how to configure Amazon MSK with cross-Region replication for the DR process. The following diagram illustrates our architecture.

We create two MSK clusters across the primary and secondary Regions (mapping to your chosen Regions), with the primary being active and secondary being passive. We can also extend this solution to an active-active setup. Our Kafka clients interact with the primary Region’s MSK cluster. The Kafka Connect cluster is deployed in the secondary Region’s MSK cluster and hosts the MirrorMaker connectors responsible for replication.

We go through the following steps to show the end-to-end process of setting up the deployment, failing over the clients if a Regional outage occurs, and failing back after the outage:

Set up an MSK cluster in the primary Region.
Set up an MSK cluster in the secondary Region.
Set up connectivity between the two MSK clusters.
Deploy Kafka Connect as containers using AWS Fargate.
Deploy MirrorMaker connectors on the Kafka Connect cluster.
Confirm data is replicated from one Region to another.
Fail over clients to the secondary Region.
Fail back clients to the primary Region.

Step 1: Set up an MSK cluster in the primary Region

To set up an MSK cluster in your primary Region, complete the following steps:

Create an Amazon Virtual Private Cloud (Amazon VPC) in the Region where you want to have your primary MSK cluster.
Create three (or at least two) subnets in the VPC.
Create an MSK cluster using the AWS Command Line Interface (AWS CLI) or the AWS Management Console.

For this post, we use the console. For instructions, see Creating an Amazon MSK Cluster.

Choose the Kafka version as 2.7 or higher.
Pick the broker instance type based on your use case and configuration needs.
Choose the VPC and subnets created to make sure the brokers in your MSK clusters are spread across multiple Availability Zones.
For Encrypt Data in transit, choose TLS encryption between brokers and between client and brokers.
For Authentication, you can choose IAM access control, TLS-based authentication, or username/password authentication.

We use SASL/SCRAM (Simple Authentication and Security Layer/Salted Challenge Response Authentication Mechanism) authentication to authenticate Apache Kafka clients using usernames and passwords for clusters secured by AWS Secrets Manager. AWS has since launched IAM Access Control which could be used as authentication for this solution, For more information about IAM Access Control, see Securing Apache Kafka is easy and familiar with IAM Access Control for Amazon MSK.

Create the secret in Secrets Manager and associate it to the MSK cluster. For instructions, see Username and password authentication with AWS Secrets Manager.

Make sure the secrets are encrypted with a customer managed key via AWS Key Management Service (AWS KMS).

Step 2: Set up an MSK cluster in the secondary Region

To set up an MSK cluster in our secondary Region, complete the following steps:

Create an MSK cluster in another Region with similar configuration to the first.
Make sure the number of brokers and instance type match what was configured in the primary.

This makes sure the secondary cluster has the same capacity and performance metrics as the primary cluster.

For Encrypt Data in transit, choose TLS encryption between brokers and between client and brokers.
For Authentication, choose the same authentication mechanism as with the cluster in the primary Region.
Create a secret in Secrets Manager and secure with a customer managed KMS key in the Region of the MSK cluster.

Step 3: Set up connectivity between the two MSK clusters

For data to replicate between the two MSK clusters, you need to allow the clusters in different VPCs to communicate with each other, where VPCs are within the same or a different AWS account, or the same or different Region. You have the following options for resources in either VPC to communicate with each other as if they’re within the same network:

VPC peering
AWS Transit Gateway
AWS PrivateLink (for a related use case, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink)

For more information about access options, see Accessing an Amazon MSK Cluster.

VPC peering is more suited for environments that have a high degree of trust between the parties that are peering their VPCs. This is because, after a VPC peering connection is established, the resources in either VPC can initiate a connection. You’re responsible for implementing fine-grained network access controls with security groups to make sure that only specific resources intended to be reachable are accessible between the peered VPCs. For our data replication use case, we assume that the two VPCs are trusted and therefore we can use VPC peering connectivity to replicate data between the primary and secondary MSK clusters. For instructions on setting up VPC peering connections between two VPCs across two Regions, see Creating and accepting a VPC peering connection.

When you set up VPC peering, enable DNS resolution support. This allows you to resolve public IPv4 DNS hostnames to private IPv4 addresses when queried from instances in the peer VPC. To enable DNS resolution on VPC peering, you must have the two peering VPCs enabled for DNS hostnames and DNS resolution. This step is important for you to be able to access the MSK cluster using DNS names across the VPCs.

Step 4: Deploy Kafka Connect as containers using AWS Fargate

Kafka Connect is a scalable and reliable framework to stream data between a Kafka cluster and external systems. Connectors in Kafka Connect define where data should be copied to and from. Each connector instance coordinates a set of tasks that copy the data. Connectors and tasks are logical units of work and must be scheduled to run in a process. Kafka Connect calls these processes workers and has two types of workers: standalone and distributed.

Deploying Kafka Connect in a distributed mode provides scalability and automatic fault tolerance for the tasks that are deployed in the worker. In distributed mode, you start many worker processes using the same group ID, and they automatically coordinate to schedule running connectors and tasks across all available workers. If you add a worker, shut down a worker, or a worker fails unexpectedly, the rest of the workers detect this and automatically coordinate to redistribute connectors and tasks across the updated set of available workers.

Kafka Connect in distributed mode lends itself to be deployed as containers (workers) and scales based on the number of tasks and connectors that are being deployed on Kafka Connect.

Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS). Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design.

For replicating data using MirrorMaker, the pattern of remote-consume and local-produce is recommended, so in the simplest source-destination replication pair, you want to deploy your MirrorMaker connectors on Kafka Connect in your destination MSK cluster. This avoids loss of data because data is replicated across Regions. In this step, we build and run a distributed Kafka Connect in a Fargate cluster.

The Docker container for Kafka Connect is available on GitHub. For more details on the Docker container and its content, refer to the README.md file.

Clone the code from GitHub and build the code.
Push the image into a repository in Amazon Elastic Container Registry (Amazon ECR).
Create a Fargate cluster in your secondary Region, in the same VPC as your MSK cluster.
Deploy the Fargate cluster.
Deploy the Kafka Connect containers.

The task definition JSON to deploy Kafka Connect containers is available on GitHub. The JSON file refers to a Docker container that was pushed into Amazon ECR earlier.

Replace the IMAGE_URL string in the JSON file with the actual image from Amazon ECR.
Replace the IAM_ROLE string with the ARN of your AWS Identity and Access Management (IAM) role.

The IAM role for the Amazon ECS task should have permission to interact with MSK clusters, read secrets from Secret Manager, decrypt the KMS key used to encrypt the secret, read images from Amazon ECR, and write logs to Amazon CloudWatch.

Make sure to update in the following environment variables with the appropriate values in the task definition:
1. BROKERS – The bootstrap servers connection string of the MSK cluster in the secondary Region.
2. USERNAME – The username that was created as a secret in Secrets Manager and associated with the MSK cluster in the secondary Region.
3. PASSWORD – The password that was created as a secret in Secrets Manager and associated with the MSK cluster in the secondary Region.
4. GROUP – The Kafka Connect group ID to register all containers to the same group.
Create a service based on the task definition and deploy at least two tasks on the Fargate cluster.
Wait until the tasks are provisioned and in running status.
Log in from a bastion host or any Amazon Elastic Compute Cloud (Amazon EC2) instance in the VPC that you can log in to (using SSH or AWS Systems Manager Session Manager).

You use this EC2 instance for your administration of the Kafka Connect cluster. Because you use this host to interact with MSK clusters, you have to download Kafka binary (greater than 2.7 version).

Each running Fargate task gets its own elastic network interface (ENI) and IPV4 address, which you can use to connect to the application running on the task. You can view the ENI attachment information for tasks on the Amazon ECS console or with the DescribeTasks API operation.

Connect to one of the Amazon ECS task IPs and check if the Kafka Connect cluster is up and running (Kafka Connect runs on port 8083):

curl <ip-address>:8083 | jq .

{
  "version": "2.7.0",
  "commit": "448719dc99a19793",
  "kafka_cluster_id": "J1xVaRK9QW-1eJq3jJvbsQ"
}

Step 5: Deploy MirrorMaker connectors on the Kafka Connect cluster

MirrorMaker 2 is based on the Kafka Connect framework and runs based on Kafka source connectors. In the Kafka Connect configuration, a source connector reads data from any data repository and writes data into a Kafka cluster, and a sink connector reads data from a Kafka cluster and writes to any data repository.

MM2 creates remote topics, which are replicated topics that refer back to the source cluster topics using an alias. This is handled by a class called the replication policy class; a default class is provided by Apache Kafka.

For example, the following diagram shows TopicA in a source cluster with alias Primary, which gets replicated to a destination cluster with the topic name Primary.TopicA.

In a failover scenario, when you move your Kafka clients from one cluster to another, you have to modify your clients to pick from a different topic as it fails over, or you have to configure them to pick from both these topics to handle the failover scenario. For example, a consumer reading from TopicA in the primary cluster upon failover has to be modified to start reading from Primary.TopicA. Alternatively, the consumers can always be configured to read from both topics.

If you want to have the same topic name across your clusters after replication, because you want to minimize changes to your clients, you can use a custom replication policy that overrides MM2’s default behavior of creating remote topics. You can find sample code on GitHub.

For an active-active setup, you have to use Kafka’s default replication policy for creating remote topics with a prefix. Having the same topic names across clusters using a custom replication policy causes an infinite loop of replication.

In this post, we use a custom replication policy with active-passive setup, in which your Kafka clients fail over in a Regional outage scenario and fail back when the outage is over.

To run a successful MirrorMaker 2 deployment, you need several connectors:

MirrorSourceConnector – Responsible for replicating data from topics as well as metadata about topics and partitions. This connector reads from a cluster and writes it to the cluster on which Kafka Connect is deployed.
HeartBeatConnector – Emits a heartbeat that gets replicated to demonstrate connectivity across clusters. We can use the internal topic heartbeats to verify that the connector is running and the cluster where the connector is running is available.
CheckpointConnector – Responsible for emitting checkpoints in the secondary cluster containing offsets for each consumer group in the primary cluster. To do that, it creates an internal topic called <primary-alias>.checkpoints.internal in the secondary cluster. In addition, this connector also creates the topic mm2-offset-syncs.<primary-alias>.internal in the secondary cluster, where it stores consumer offsets that are translated into the ones that make sense in another cluster. This is required as the clients fail over from the primary cluster to secondary to be able to read the messages from the secondary cluster at the correct offset. Prior to Apache Kafka 2.7, MM2 didn’t have a mechanism to sync the offsets for individual consumer groups with the __consumer_offsets internal topic in the secondary cluster. Syncing with __consumer_offsets can allow consumers to simply fail over and continue to process messages from the last offset retrieved from __consumer_offsets in the secondary cluster. Consequently, this had to be done outside of MM2 with an asynchronous process utilizing custom code. The following sample project contains code to do this translation. However, in Apache 2.7, a new feature was released that takes care of synchronizing the translated offsets directly to the _consumer_offsets topic in the cluster, so that when you switch over, you can start from last known offset. To enable this feature, you need to include the property group.offsets.enabled = true in the connector configuration.

Sample connector configurations for each of these connectors are available on GitHub. The configurations contain SASL/SCRAM-related information to connect to the cluster. Make sure the number of tasks match the number of partitions in your Kafka topics. This enables parallel processing to read multiple partitions in parallel. The configuration also uses CustomMM2ReplicationPolicy to make sure the topics are replicated with the same name across clusters. You can remove this line as long as you update the Kafka client to read from topic names with a prefix when using the MSK cluster in the secondary Region.

To deploy these connectors on the Kafka Connect cluster, log back in to the bastion host machine that acts as your administrative console. Make sure the bastion host has an IAM role that has access to the KMS key encrypting your Secrets Manager secrets corresponding to your MSK cluster. Find the IP address of one of the containers running your Kafka Connect cluster.

For instructions on reviewing your connector configuration and deploying it on Kafka Connect, see Configure and start MirrorMaker 2 connectors.

Check the topics that are created in your primary and secondary cluster. The primary MSK cluster should have a new mm2-offset-syncs.sec.internal topic and the secondary MSK cluster should have the heartbeats and pri.checkpoints.internal topics.

Step 6: Confirm the data is replicated from one Region to another

With the connectors up and running on the Kafka Connect cluster, you should now create a topic in your primary cluster and see it replicate to the secondary cluster (with a prefix, if you used the default replication policy).

After the topic replication is configured, you can start producing data into the new topic. You can use the following sample producer code for testing. You can also use a Kafka console producer or your own producer. Make sure the producer can support SASL/SCRAM based connectivity.

If you use the sample producer code, make sure to create a producer.properties file and provide the bootstrap server information of your primary cluster to the BOOTSTRAP_SERVERS_CONFIG property. Then start your producer with the following code:

java -jar KafkaClickstreamClient-1.0-SNAPSHOT.jar -t <topic-name> -pfp <properties_file_path> -nt 8 -rf 300 -sse -ssu <user name> -gsr -grn <glue schema registry name >  -gar > /tmp/producer.log 2>&1 &

The GitHub project has more details on the command line parameters and how to change the rate at which messages are produced.

Confirm the messages produced in the topic in the primary cluster are all flowing to the topic in the destination cluster by checking the message count. For this post, we use kafkacat, which supports SASL/SCRAM to count the messages:

docker run -it --network=host edenhill/kafkacat:1.6.0 -b <bootstrap-servers> -X security.protocol=SASL_SSL -X sasl.mechanism=SCRAM-SHA-512 -X sasl.username=<username>  -X sasl.password=<pwd> -t <topicname> -C -e -q| wc -l

In production environments, if the message counts are large, use your traditional monitoring tool to check the message count, because tools like kafkacat take a long time to consume messages and report on a message count.

Now that you have confirmed the producer is actively producing messages to the topic and that the data is replicated, we can spin up a consumer to consume the messages from the topic. We spin up the consumer in the primary Region, because in an active-passive setup, all activities happen in the primary Region until an outage occurs and you fail over.

You can use the following sample consumer code for testing. You can also use a Kafka console consumer or your own consumer. Make sure that the consumer code can support connectivity with SASL/SCRAM.

For the sample code, make sure to create a consumer.properties file that contains the Amazon MSK bootstrap broker information of the primary cluster for the BOOTSTRAP_SERVERS_CONFIG property. Run the following code to start the consumer:

java -jar KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t <topic> -pfp <properties file path> -nt 3 -rf 10800 -sse -ssu <username> -src <primary cluster alias> -gsr -grn <glue schema registry name> /tmp/consumer_dest.log 2>&1 &

As explained before, the consumer offsets of this consumer group get replicated to the secondary cluster and get synced up to the _consumer_offsets table. It takes a few minutes for the consumer group offset to sync from the secondary to the primary depending on the value of sync.group.offsets.interval.seconds in the checkpoint connector configuration. See the following code:

./bin/kafka-consumer-groups.sh --bootstrap-server <bootstrap-url>  --command-config /opt/ssl-user-config.properties --describe --group mm2TestConsumer1

Make sure ssl-user-config.properties contains the connectivity information:

sasl.mechanism=SCRAM-SHA-512
# Configure SASL_SSL if SSL encryption is enabled, otherwise configure SASL_PLAINTEXT
security.protocol=SASL_SSL
sasl.jaas.config=<jaas -config>

The consumer group offsets are now synced up from the primary to the secondary cluster. This helps us fail over clients to the secondary cluster because the consumer started in the primary cluster can start consuming from the secondary cluster and read from where it left off after failover.

Step 7: Fail over clients to the secondary Region

In a DR scenario, if you need to fail clients from your cluster in the primary Region to a secondary Region, follow the steps in this section.

You want to start with shutting off your consumer in the primary Region. Start the consumer in the secondary Region by updating the bootstrap server’s information pointing to the secondary MSK cluster. Because the topic name is the same across both Regions, you don’t need to change the consumer client code. This consumer starts consuming the messages from the topic even if new messages are still being produced by producers in the primary Region.

Now you can stop the producer in the primary Region (if not already stopped due to Regional failure) and start it on the secondary Region by updating the bootstrap server’s information. The consumer in the secondary Region keeps consuming messages on the topic, including the ones now being produced in the secondary Region. After the consumer and producer are failed over, you can delete the MM2 connectors on the Kafka Connect cluster using the HTTP endpoints of the connectors. See the following code:

curl -X DELETE http://<ip-address>:8083/connectors/mm2-msc

This stops all replication activities from the primary cluster to the secondary cluster. Now the MSK cluster in the primary Region is available for upgrade or any other activities.

If a Regional outage is impacting Amazon MSK or Apache Kafka on Amazon EC2, it’s highly probable that the clients, producers, and consumers running on Amazon EC2, Amazon ECS, Amazon EKS, and AWS Lambda are also impacted. In this case, you have to stop the MM2 connectors in the DR cluster because the source cluster isn’t available to replicate from. To recover clients, you can start the consumers on the secondary cluster. The consumers read messages from the topic from where it left off. Start the producers on the secondary cluster and push new messages to the topic in the secondary cluster.

Now that you have successfully failed over from the MSK cluster in the primary Region to the secondary Region, we can see how to fail back after your MSK cluster in the primary Region is ready to be operational.

Step 8: Fail back clients to the primary Region

Check the topics in your MSK cluster in the primary Region. Depending on the activity on the secondary (now primary) cluster during the DR period, you might want to start with fresh data from your secondary cluster. Follow these steps to get your primary cluster synced up with all data required:

Delete all topics (if any) except the _consumer_offsets topic in the primary cluster.
Create a Kafka Connect cluster deploying Fargate containers (as we walked through earlier), with brokers pointing to the MSK cluster in the primary Region.

MirrorSourceConnector can write only to the cluster where Kafka Connect is deployed. Because we want to replicate from the secondary to the primary, we need a Kafka Connect cluster associated to the primary Region.

Deploy the MirrorMaker connectors (similar to what we did earlier) using the configuration samples This time, make sure the source and target bootstrap broker information is flipped on all configuration files.

These connectors are responsible for replicating data from the secondary back to the primary. Make sure to list the topic names containing your data in topics of the MirrorSourceConnector file. You don’t want to replicate topics created by Kafka Connect and MirrorMaker in the secondary Region, because that creates confusion.

This process starts replication activities from your MSK cluster in the secondary Region to the MSK cluster in the primary Region. This happens in parallel because the producers and consumers are actively writing and reading in the MSK cluster in the secondary Region.

Wait until all the data topics and their messages are replicated from the MSK cluster in the secondary Region to the MSK cluster in the primary Region.

The time it takes depends on the number of messages in the topic.

Check the number of messages in the topics on the MSK cluster in the primary Region.

When the number of messages is close to the number of messages in the MSK cluster in the secondary Region, it’s time to fail back your Kafka clients.

Stop the consumers from the secondary Region one by one and move them to point to the primary cluster.

When the consumer is up and running, it should be able to continue to read the messages produced by producers pointing to the MSK cluster in the secondary Region. When all the consumers are healthy in the secondary, it’s time to fail back the producers as well.

Stop the producers in the secondary Region’s MSK cluster and start them by pointing to the primary Region’s MSK cluster.

To enable the MirrorMaker replication back from the primary to secondary, you have to stop the MirrorMaker connectors replicating from the secondary to primary. Because we’re using CustomReplicationPolicy, which tries to use the same topic names, it’s important to have replication of data flowing only one direction, otherwise it creates a recursive loop. You have to repeat similar cleanup steps to get the replication flowing back from the primary to secondary.

Using the default replication policy in MirrorMaker 2

When you use MirrorMaker 2’s default replication policy, it creates topics with a prefix, as explained earlier. This enables you to run dual-way replication because MirrorMaker 2 ignores the topics with a prefix when replicating. This is convenient because you don’t have to delete the MirrorMaker connect configuration moving from one side to another, which makes failover and failback easier.

Make sure to update your clients to read from topics with and without prefixes, because it can read from either of the cluster as part of failover and failback. In addition, if you have a use case to enable active-active setup, it’s imperative that you choose MirrorMaker 2’s default replication policy.

Conclusion

In this post, I reviewed how to set up a highly resilient deployment across Regions for an MSK cluster using MirrorMaker 2 deployed on a distributed Kafka Connect cluster in Fargate. You can use this solution to build a data redundancy capability to meet regulatory compliance, business continuity, and DR requirements. With MirrorMaker 2, you can also set up an active-active MSK cluster, enabling clients to consume from an MSK cluster that has geographical proximity.

About the Author

Anusha Dharmalingam is a Solutions Architect at Amazon Web Services, with a passion for Application Development and Big Data solutions. Anusha works with enterprise customers to help them architect, build, and scale applications to achieve their business goals.

Introducing Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python, or Scala

2021-05-27 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-kinesis-data-analytics-studio-quickly-interact-with-streaming-data-using-sql-python-or-scala/

The best way to get timely insights and react quickly to new information you receive from your business and your applications is to analyze streaming data. This is data that must usually be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and can be used for a variety of analytics including correlations, aggregations, filtering, and sampling.

To make it easier to analyze streaming data, today we are pleased to introduce Amazon Kinesis Data Analytics Studio.

Now, from the Amazon Kinesis console you can select a Kinesis data stream and with a single click start a Kinesis Data Analytics Studio notebook powered by Apache Zeppelin and Apache Flink to interactively analyze data in the stream. Similarly, you can select a cluster in the Amazon Managed Streaming for Apache Kafka console to start a notebook to analyze data in Apache Kafka streams. You can also start a notebook from the Kinesis Data Analytics Studio console and connect to custom sources.

In the notebook, you can interact with streaming data and get results in seconds using SQL queries and Python or Scala programs. When you are satisfied with your results, with a few clicks you can promote your code to a production stream processing application that runs reliably at scale with no additional development effort.

For new projects, we recommend that you use the new Kinesis Data Analytics Studio over Kinesis Data Analytics for SQL Applications. Kinesis Data Analytics Studio combines ease of use with advanced analytical capabilities, which makes it possible to build sophisticated stream processing applications in minutes. Let’s see how that works in practice.

Using Kinesis Data Analytics Studio to Analyze Streaming Data
I want to get a better understanding of the data sent by some sensors to a Kinesis data stream.

To simulate the workload, I use this random_data_generator.py Python script. You don’t need to know Python to use Kinesis Data Analytics Studio. In fact, I am going to use SQL in the following steps. Also, you can avoid any coding and use the Amazon Kinesis Data Generator user interface (UI) to send test data to Kinesis Data Streams or Kinesis Data Firehose. I am using a Python script to have finer control over the data that is being sent.

import datetime
import json
import random
import boto3

STREAM_NAME = "my-input-stream"


def get_random_data():
    current_temperature = round(10 + random.random() * 170, 2)
    if current_temperature > 160:
        status = "ERROR"
    elif current_temperature > 140 or random.randrange(1, 100) > 80:
        status = random.choice(["WARNING","ERROR"])
    else:
        status = "OK"
    return {
        'sensor_id': random.randrange(1, 100),
        'current_temperature': current_temperature,
        'status': status,
        'event_time': datetime.datetime.now().isoformat()
    }


def send_data(stream_name, kinesis_client):
    while True:
        data = get_random_data()
        partition_key = str(data["sensor_id"])
        print(data)
        kinesis_client.put_record(
            StreamName=stream_name,
            Data=json.dumps(data),
            PartitionKey=partition_key)


if __name__ == '__main__':
    kinesis_client = boto3.client('kinesis')
    send_data(STREAM_NAME, kinesis_client)

This script sends random records to my Kinesis data stream using JSON syntax. For example:

{'sensor_id': 77, 'current_temperature': 93.11, 'status': 'OK', 'event_time': '2021-05-19T11:20:00.978328'}
{'sensor_id': 47, 'current_temperature': 168.32, 'status': 'ERROR', 'event_time': '2021-05-19T11:20:01.110236'}
{'sensor_id': 9, 'current_temperature': 140.93, 'status': 'WARNING', 'event_time': '2021-05-19T11:20:01.243881'}
{'sensor_id': 27, 'current_temperature': 130.41, 'status': 'OK', 'event_time': '2021-05-19T11:20:01.371191'}

From the Kinesis console, I select a Kinesis data stream (my-input-stream) and choose Process data in real time from the Process drop-down. In this way, the stream is configured as a source for the notebook.

Then, in the following dialog box, I create an Apache Flink – Studio notebook.

I enter a name (my-notebook) and a description for the notebook. The AWS Identity and Access Management (IAM) permissions to read from the Kinesis data stream I selected earlier (my-input-stream) are automatically attached to the IAM role assumed by the notebook.

I choose Create to open the AWS Glue console and create an empty database. Back in the Kinesis Data Analytics Studio console, I refresh the list and select the new database. It will define the metadata for my sources and destinations. From here, I can also review the default Studio notebook settings. Then, I choose Create Studio notebook.

Now that the notebook has been created, I choose Run.

When the notebook is running, I choose Open in Apache Zeppelin to get access to the notebook and write code in SQL, Python, or Scala to interact with my streaming data and get insights in real time.

In the notebook, I create a new note and call it Sensors. Then, I create a sensor_data table describing the format of the data in the stream:

%flink.ssql

CREATE TABLE sensor_data (
    sensor_id INTEGER,
    current_temperature DOUBLE,
    status VARCHAR(6),
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
)
PARTITIONED BY (sensor_id)
WITH (
    'connector' = 'kinesis',
    'stream' = 'my-input-stream',
    'aws.region' = 'us-east-1',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
)

The first line in the previous command tells to Apache Zeppelin to provide a stream SQL environment (%flink.ssql) for the Apache Flink interpreter. I can also interact with the streaming data using a batch SQL environment (%flink.bsql), or Python (%flink.pyflink) or Scala (%flink) code.

The first part of the CREATE TABLE statement is familiar to anyone who has used SQL with a database. A table is created to store the sensor data in the stream. The WATERMARK option is used to measure progress in the event time, as described in the Event Time and Watermarks section of the Apache Flink documentation.

The second part of the CREATE TABLE statement describes the connector used to receive data in the table (for example, kinesis or kafka), the name of the stream, the AWS Region, the overall data format of the stream (such as json or csv), and the syntax used for timestamps (in this case, ISO 8601). I can also choose the starting position to process the stream, I am using LATEST to read the most recent data first.

When the table is ready, I find it in the AWS Glue Data Catalog database I selected when I created the notebook:

Now I can run SQL queries on the sensor_data table and use sliding or tumbling windows to get a better understanding of what is happening with my sensors.

For an overview of the data in the stream, I start with a simple SELECT to get all the content of the sensor_data table:

%flink.ssql(type=update)

SELECT * FROM sensor_data;

This time the first line of the command has a parameter (type=update) so that the output of the SELECT, which is more than one row, is continuously updated when new data arrives.

On the terminal of my laptop, I start the random_data_generator.py script:

$ python3 random_data_generator.py

At first I see a table that contains the data as it comes. To get a better understanding, I select a bar graph view. Then, I group the results by status to see their average current_temperature, as shown here:

As expected by the way I am generating these results, I have different average temperatures depending on the status (OK, WARNING, or ERROR). The higher the temperature, the greater the probability that something is not working correctly with my sensors.

I can run the aggregated query explicitly using a SQL syntax. This time, I want the result computed on a sliding window of 1 minute with results updated every 10 seconds. To do so, I am using the HOP function in the GROUP BY section of the SELECT statement. To add the time to the output of the select, I use the HOP_ROWTIME function. For more information, see how group window aggregations work in the Apache Flink documentation.

%flink.ssql(type=update)

SELECT sensor_data.status,
       COUNT(*) AS num,
       AVG(sensor_data.current_temperature) AS avg_current_temperature,
       HOP_ROWTIME(event_time, INTERVAL '10' second, INTERVAL '1' minute) as hop_time
  FROM sensor_data
 GROUP BY HOP(event_time, INTERVAL '10' second, INTERVAL '1' minute), sensor_data.status;

This time, I look at the results in table format:

To send the result of the query to a destination stream, I create a table and connect the table to the stream. First, I need to give permissions to the notebook to write into the stream.

In the Kinesis Data Analytics Studio console, I select my-notebook. Then, in the Studio notebooks details section, I choose Edit IAM permissions. Here, I can configure the sources and destinations used by the notebook and the IAM role permissions are updated automatically.

In the Included destinations in IAM policy section, I choose the destination and select my-output-stream. I save changes and wait for the notebook to be updated. I am now ready to use the destination stream.

In the notebook, I create a sensor_state table connected to my-output-stream.

%flink.ssql

CREATE TABLE sensor_state (
    status VARCHAR(6),
    num INTEGER,
    avg_current_temperature DOUBLE,
    hop_time TIMESTAMP(3)
)
WITH (
'connector' = 'kinesis',
'stream' = 'my-output-stream',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601');

I now use this INSERT INTO statement to continuously insert the result of the select into the sensor_state table.

%flink.ssql(type=update)

INSERT INTO sensor_state
SELECT sensor_data.status,
    COUNT(*) AS num,
    AVG(sensor_data.current_temperature) AS avg_current_temperature,
    HOP_ROWTIME(event_time, INTERVAL '10' second, INTERVAL '1' minute) as hop_time
FROM sensor_data
GROUP BY HOP(event_time, INTERVAL '10' second, INTERVAL '1' minute), sensor_data.status;

The data is also sent to the destination Kinesis data stream (my-output-stream) so that it can be used by other applications. For example, the data in the destination stream can be used to update a real-time dashboard, or to monitor the behavior of my sensors after a software update.

I am satisfied with the result. I want to deploy this query and its output as a Kinesis Analytics application. To do so, I need to provide an S3 location to store the application executable.

In the configuration section of the console, I edit the Deploy as application configuration settings. There, I choose a destination bucket in the same region and save changes.

I wait for the notebook to be ready after the update. Then, I create a SensorsApp note in my notebook and copy the statements that I want to execute as part of the application. The tables have already been created, so I just copy the INSERT INTO statement above.

From the menu at the top right of my notebook, I choose Build SensorsApp and export to Amazon S3 and confirm the application name.

When the export is ready, I choose Deploy SensorsApp as Kinesis Analytics application in the same menu. After that, I fine-tune the configuration of the application. I set parallelism to 1 because I have only one shard in my input Kinesis data stream and not a lot of traffic. Then, I run the application, without having to write any code.

From the Kinesis Data Analytics applications console, I choose Open Apache Flink dashboard to get more information about the execution of my application.

Availability and Pricing
You can use Amazon Kinesis Data Analytics Studio today in all AWS Regions where Kinesis Data Analytics is generally available. For more information, see the AWS Regional Services List.

In Kinesis Data Analytics Studio, we run the open-source versions of Apache Zeppelin and Apache Flink, and we contribute changes upstream. For example, we have contributed bug fixes for Apache Zeppelin, and we have contributed to AWS connectors for Apache Flink, such as those for Kinesis Data Streams and Kinesis Data Firehose. Also, we are working with the Apache Flink community to contribute availability improvements, including automatic classification of errors at runtime to understand whether errors are in user code or in application infrastructure.

With Kinesis Data Analytics Studio, you pay based on the average number of Kinesis Processing Units (KPU) per hour, including those used by your running notebooks. One KPU comprises 1 vCPU of compute, 4 GB of memory, and associated networking. You also pay for running application storage and durable application storage. For more information, see the Kinesis Data Analytics pricing page.

Start using Kinesis Data Analytics Studio today to get better insights from your streaming data.

— Danilo

Amazon MSK backup for Archival, Replay, or Analytics

2021-02-19 Rohit Yadav

Post Syndicated from Rohit Yadav original https://aws.amazon.com/blogs/architecture/amazon-msk-backup-for-archival-replay-or-analytics/

Amazon MSK is a fully managed service that helps you build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes. You can also stream changes to and from databases, and power machine learning and analytics applications.

Amazon MSK simplifies the setup, scaling, and management of clusters running Apache Kafka. MSK manages the provisioning, configuration, and maintenance of resources for a highly available Kafka clusters. It is fully compatible with Apache Kafka and supports familiar community-build tools such as MirrorMaker 2.0, Kafka Connect and Kafka streams.

Introduction

In the past few years, the volume of data that companies must ingest has increased significantly. Information comes from various sources, like transactional databases, system logs, SaaS platforms, mobile, and IoT devices. Businesses want to act as soon as the data arrives. This has resulted in increased adoption of scalable real-time streaming solutions. These solutions scale horizontally to provide the needed throughput to process data in real time, with milliseconds of latency. Customers have adopted Amazon MSK as a top choice of streaming platforms. Amazon MSK gives you the flexibility to retain topic data for longer term (default 7 days). This supports replay, analytics, and machine learning based use cases. When IT and business systems are producing and processing terabytes of data per hour, it can become expensive to store, manage, and retrieve data. This has led to legacy data archival processes moving towards cheaper, reliable, and long-term storage solutions like Amazon Simple Storage Service (S3).

Following are some of the benefits of archiving Amazon MSK topic data to Amazon S3:

Reduced Cost – You only must retain the data in the cluster based on your Recovery Point Objective (RPO). Any historical data can be archived in Amazon S3 and replayed if necessary.
Integration with Enterprise Data Lake – Since your data is available in S3, you can now integrate with other data analytics services like Amazon EMR, AWS Glue, Amazon Athena, to run data aggregation and analytics. For example, you can build reports to visualize month over month changes.
Optimize Machine Learning Workloads – Machine learning applications will be able to train new models and improve predictions using historical streams of data available in Amazon S3. This also enables better integration with Amazon Machine Learning services.
Compliance – Long-term data archival for regulatory and security compliance.
Backloading data to other systems – Ability to rebuild data into other application environments such as pre-prod, testing, and more.

There are many benefits to using Amazon S3 as long-term storage for Amazon MSK topics. Let’s dive deeper into the recommended architecture for this pattern. We will present an architecture to back up Amazon MSK topics to Amazon S3 in real time. In addition, we’ll demonstrate some of the use cases previously mentioned.

Architecture

The diagram following illustrates the architecture for building a real-time archival pipeline to archive Amazon MSK topics to S3. This architecture uses an AWS Lambda function to process records from your Amazon MSK cluster when the cluster is configured as an event source. As a consumer, you don’t need to worry about infrastructure management or scaling with Lambda. You only pay for what you consume, so you don’t pay for over-provisioned infrastructure.

To create an event source mapping, you can add your Amazon MSK cluster in a Lambda function trigger. The Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches from one or more partitions and provides these to your function as an event payload. The function then processes records, and sends the payload to an Amazon Kinesis Data Firehose delivery stream. We use Kinesis Data Firehose delivery stream because it can natively batch, compress, transform, and encrypt your events before loading to S3.

In this architecture, Kinesis Data Firehose delivers the records received from Lambda in Gzip file to Amazon S3. These files are partitioned in hive style format by Kinesis Data Firehose:

data/year = yyyy/month = MM/day = dd/hour = HH

Figure 1. Archival Architecture

Let’s review some of the possible solutions that can be built on this archived data.

Integration with Enterprise Data Lake

The architecture diagram following shows how you can integrate the archived data in Amazon S3 with your Enterprise Data Lake. Since the data files are prefixed in hive style format, you can partition and store the Data Catalog in AWS Glue. With partitioning in place, you can perform optimizations like partition pruning, which enables predicate pushdown for improved performance of your analytics queries. You can also use AWS Data Analytics services like Amazon EMR and AWS Glue for batch analytics. Amazon Athena can be used to run serverless SQL-like interactive queries on visualization and data.

Data currently gets stored in JSON files. Following are some of the services/tools that can be integrated with your archive for reporting, analytics, visualization, and machine learning requirements.

Figure 2. Analytics Architecture

Cloning data into other application environments

There are use cases where you would want to use this data to clone other application environments using this archive.

These clusters could be used for testing or debugging purposes. You could decide to use only a subset of your data from the archive. Let’s say you want to debug an issue beyond the configured retention period, but not replicate all the data to your testing environment. With archived data in S3, you can build downstream jobs to filter data that can be loaded into a new Amazon MSK cluster. The following diagram highlights this pattern:

Figure 3. Replay Architecture

Ready for a Test Drive

To help you get started, we would like to introduce an AWS Solution: AWS Streaming Data Solution for Amazon MSK (scroll down and see Option 3 tab). There is a single-click AWS CloudFormation template, which can assist you in quickly provisioning resources. This will get your real-time archival pipeline for Amazon MSK up and running quickly. This solution shortens your development time by removing or reducing the need for you to:

Model and provision resources using AWS CloudFormation
Set up Amazon CloudWatch alarms, dashboards, and logging
Manually implement streaming data best practices in AWS

This solution is data and logic agnostic, enabling you to start with boilerplate code and start customizing quickly. After deployment, use this solution’s monitoring capabilities to transition easily to production.

Conclusion

In this post, we explained the architecture to build a scalable, highly available real-time archival of Amazon MSK topics to long term storage in Amazon S3. The architecture was built using Amazon MSK, AWS Lambda, Amazon Kinesis Data Firehose, and Amazon S3. The architecture also illustrates how you can integrate your Amazon MSK streaming data in S3 with your Enterprise Data Lake.

Validate, evolve, and control schemas in Amazon MSK and Amazon Kinesis Data Streams with AWS Glue Schema Registry

2021-01-13 Brian Likosar

Post Syndicated from Brian Likosar original https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/

Data streaming technologies like Apache Kafka and Amazon Kinesis Data Streams capture and distribute data generated by thousands or millions of applications, websites, or machines. These technologies serve as a highly available transport layer that decouples the data-producing applications from data processors. However, the sheer number of applications producing, processing, routing, and consuming data can make it hard to coordinate and evolve data schemas, like adding or removing a data field, without introducing data quality issues and downstream application failures. Developers often build complex tools, write custom code, or rely on documentation, change management, and Wikis to protect against schema changes. This is quite error prone because it relies too heavily on human oversight. A common solution with data streaming technologies is a schema registry that provides for validation of schema changes to allow for safe evolution as business needs adjust over time.

AWS Glue Schema Registry, a serverless feature of AWS Glue, enables you to validate and reliably evolve streaming data against Apache Avro schemas at no additional charge. Through Apache-licensed serializers and deserializers, the Glue Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

This post explains the benefits of using the Glue Schema Registry and provides examples of how to use it with both Apache Kafka and Kinesis Data Streams.

With the Glue Schema Registry, you can eliminate defensive coding and cross-team coordination, improve data quality, reduce downstream application failures, and use a registry that is integrated across multiple AWS services. Each schema can be versioned within the guardrails of a compatibility mode, providing developers the flexibility to reliably evolve schemas. Additionally, the Glue Schema Registry can serialize data into a compressed format, helping you save on data transfer and storage costs.

Although there are many ways to leverage the Glue Schema Registry (including using the API to build your own integrations), in this post, we show two use cases. The Schema Registry is a free feature that can significantly improve data quality and developer productivity. If you use Avro schemas, you should be using the Schema Registry to supplement your solutions built on Apache Kafka (including Amazon MSK) or Kinesis Data Streams. The following diagram illustrates this architecture.

AWS Glue Schema Registry features

Glue Schema Registry has the following features:

Schema discovery – When a producer registers a schema change, metadata can be applied as a key-value pair to provide searchable information for administrators or developers. This metadata can indicate the original source of the data (source=MSK_west), the team name to contact (owner=DataEngineering), or AWS tags (environment=Production). You could potentially encrypt a field in your data on the producing client and use metadata to specify to potential consumer clients which public key fingerprint to use for decryption.
Schema compatibility – The versioning of each schema is governed by a compatibility mode. If a new version of a schema is requested to be registered that breaks the specified compatibility mode, the request fails and an exception is thrown. Compatibility checks enable developers building downstream applications to have a bounded set of scenarios to build applications against, which helps to prepare for the changes without issue. Commonly used modes are FORWARD, BACKWARD, and FULL. For more information about mode definitions, see Schema Versioning and Compatibility.
Schema validation – Glue Schema Registry serializers work to validate that the schema used during data production is compatible. If it isn’t, the data producer receives an exception from the serializer. This ensures that potentially breaking changes are found earlier in development cycles, and can also help prevent unintentional schema changes due to human error.
Auto-registration of schemas – If configured to do so, the producer of data can auto-register schema changes as they flow in the data stream. This is especially useful for use cases where the source of the data is change data capture from a database.
IAM support – Thanks to integrated AWS Identity and Access Management (IAM) support, only authorized producers can change certain schemas. Furthermore, only those consumers authorized to read the schema can do so. Schema changes are typically performed deliberately and with care, so it’s important to use IAM to control who performs these changes. Additionally, access control to schemas is important in situations where you might have sensitive information included in the schema definition itself. In the examples that follow, IAM roles are inferred via the AWS SDK for Java, so they are inherited from the Amazon Elastic Compute Cloud (Amazon EC2) instance’s role that the application runs in. IAM roles can also be applied to any other AWS service that could contain this code, such as containers or Lambda functions.
Integrations and other support – The provided serializers and deserializers are currently for Java clients using Apache Avro for data serialization. The GitHub repo also contains support for Apache Kafka Streams, Apache Kafka Connect, and Apache Flink—all licensed using the Apache License 2.0. We’re already working on additional language and data serialization support, but we need your feedback on what you’d like to see next.
Secondary deserializer – If you have already registered schemas in another schema registry, there’s an option for specifying a secondary deserializer when performing schema lookups. This allows for migrations from other schema registries without having to start anew. If the schema ID being used isn’t known to the Glue Schema Registry, it’s looked for in the secondary deserializer.
Compression – Using the Avro format already reduces message size due to its compact, binary format. Using a schema registry can further reduce data payload by no longer needing to send and receive schemas with each message. Glue Schema Registry libraries also provide an option for zlib compression, which can reduce data requirements even further by compressing the payload of the message. This varies by use case, but compression can reduce the size of the message significantly.

Example schema

For this post, we use the following schema to begin each of our use cases:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"}
 ]
}

Using AWS Glue Schema Registry with Amazon MSK and Apache Kafka

You can use the following Apache Kafka producer code to produce Apache Avro formatted messages to a topic with the preceding schema:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSKafkaAvroSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSAvroSerializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Parser;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;

public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.SCHEMA_NAME, "customer");
properties.put(AWSSchemaRegistryConstants.COMPATIBILITY_SETTING, Compatibility.FULL);
properties.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
Schema schema_customer = new Parser().parse(new File("Customer.avsc"));
GenericRecord customer = new GenericData.Record(schema_customer);

try (KafkaProducer<String, GenericRecord> producer = new KafkaProducer<String, GenericRecord>(properties)) {
final ProducerRecord<String, GenericRecord> record = new ProducerRecord<String, GenericRecord>(topic, customer);
customer.put("first_name", "Ada");
customer.put("last_name", "Lovelace");
customer.put("full_name", "Ada Lovelace");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Sue");
customer.put("last_name", "Black");
customer.put("full_name", "Sue Black");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Anita");
customer.put("last_name", "Borg");
customer.put("full_name", "Anita Borg");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Grace");
customer.put("last_name", "Hopper");
customer.put("full_name", "Grace Hopper");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Neha");
customer.put("last_name", "Narkhede");
customer.put("full_name", "Neha Narkhede");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);
producer.flush();
System.out.println("Successfully produced 5 messages to a topic called " + topic);
} catch (final InterruptedException | SerializationException e) {
e.printStackTrace();
}
}
}

Use the following Apache Kafka consumer code to look up the schema information while consuming from a topic to learn the schema details:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSKafkaAvroDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSAvroDeserializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import java.util.Collections;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;


public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "gsr-client");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());

try (final KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<String, GenericRecord>(properties)) {
consumer.subscribe(Collections.singletonList(topic));
while (true) {
final ConsumerRecords<String, GenericRecord> records = consumer.poll(1000);
for (final ConsumerRecord<String, GenericRecord> record : records) {
final GenericRecord value = record.value();
System.out.println("Received message: value = " + value);
}
			}
} catch (final SerializationException e) {
e.printStackTrace();
}
}
}

Using AWS Glue Schema Registry with Kinesis Data Streams

You can use the following Kinesis Producer Library (KPL) code to publish messages in Apache Avro format to a Kinesis data stream with the preceding schema:

private static final String SCHEMA_DEFINITION = "{"namespace": "Customer.avro",\n"
+ " "type": "record",\n"
+ " "name": "Customer",\n"
+ " "fields": [\n"
+ " {"name": "first_name", "type": "string"},\n"
+ " {"name": "last_name", "type": "string"}\n"
+ " ]\n"
+ "}";

KinesisProducerConfiguration config = new KinesisProducerConfiguration();
config.setRegion("us-west-1")

//[Optional] configuration for Schema Registry.

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration("us-west-1");

schemaRegistryConfig.setCompression(true);

config.setGlueSchemaRegistryConfiguration(schemaRegistryConfig);

///Optional configuration ends.

final KinesisProducer producer = 
new KinesisProducer(config);

final ByteBuffer data = getDataToSend();

com.amazonaws.services.schemaregistry.common.Schema gsrSchema = 
new Schema(SCHEMA_DEFINITION, DataFormat.AVRO.toString(), "demoSchema");

ListenableFuture<UserRecordResult> f = producer.addUserRecord(
config.getStreamName(), TIMESTAMP, Utils.randomExplicitHashKey(), data, gsrSchema);

private static ByteBuffer getDataToSend() {
org.apache.avro.Schema avroSchema = 
new org.apache.avro.Schema.Parser().parse(SCHEMA_DEFINITION);

GenericRecord user = new GenericData.Record(avroSchema);
user.put("name", "Emily");
user.put("favorite_number", 32);
user.put("favorite_color", "green");

ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().directBinaryEncoder(outBytes, null);
new GenericDatumWriter<>(avroSchema).write(user, encoder);
encoder.flush();
return ByteBuffer.wrap(outBytes.toByteArray());
}

On the consumer side, you can use the Kinesis Client Library (KCL) (v2.3 or later) to look up schema information while retrieving messages from a Kinesis data stream:

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration(this.region.toString());

 GlueSchemaRegistryDeserializer glueSchemaRegistryDeserializer = 
new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), schemaRegistryConfig);

 RetrievalConfig retrievalConfig = configsBuilder.retrievalConfig().retrievalSpecificConfig(new PollingConfig(streamName, kinesisClient));
 retrievalConfig.glueSchemaRegistryDeserializer(glueSchemaRegistryDeserializer);
 
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
retrievalConfig
);

 public void processRecords(ProcessRecordsInput processRecordsInput) {
MDC.put(SHARD_ID_MDC_KEY, shardId);
try {
log.info("Processing {} record(s)", 
processRecordsInput.records().size());
processRecordsInput.records()
.forEach(
r -> 
log.info("Processed record pk: {} -- Seq: {} : data {} with schema: {}", 
r.partitionKey(), r.sequenceNumber(), recordToAvroObj(r).toString(), r.getSchema()));
} catch (Throwable t) {
log.error("Caught throwable while processing records. Aborting.");
Runtime.getRuntime().halt(1);
} finally {
MDC.remove(SHARD_ID_MDC_KEY);
}
 }
 
 private GenericRecord recordToAvroObj(KinesisClientRecord r) {
byte[] data = new byte[r.data().remaining()];
r.data().get(data, 0, data.length);
org.apache.avro.Schema schema = new org.apache.avro.Schema.Parser().parse(r.schema().getSchemaDefinition());
DatumReader datumReader = new GenericDatumReader<>(schema);

BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(data, 0, data.length, null);
return (GenericRecord) datumReader.read(null, binaryDecoder);
 }

Example of schema evolution

As a producer, let’s say you want to add an additional field to our schema:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"},
 {"name": "full_name", "type": ["string", “null”], “default”: null}
]
}

Regardless of whether you’re following the Apache Kafka or Kinesis Data Streams example, you can use the previously provided producer code to publish new messages using this new schema version with the full_name field. This is simply a concatenation of first_name and last_name.

This schema change added an optional field (full_name), which is indicated by the type field having an option of null in addition to string with a default of null. In adding this optional field, we’ve created a schema evolution. This qualifies as a FORWARD compatible change because the producer has modified the schema and the consumer can read without updating its version of the schema. It’s a good practice to provide a default for a given field. This allows for its eventual removal if necessary. If it’s removed by the producer, the consumer uses the default that it knew for that field from before the removal.

This change is also a BACKWARD compatible change, because if the consumer changes the schema it expects to receive, it can use that default to fill in the value for the field it isn’t receiving. By being both FORWARD and BACKWARD compatible, it is therefore a FULL compatible change. The Glue Schema Registry serializers default to BACKWARD compatible, so we have to add a line declaring it as FULL.

In looking at the full option set, you may find FORWARD_ALL, BACKWARD_ALL, and FULL_ALL. These typically only come into play when you want to change data types for a field whose name you don’t change. The most common observed compatibility mode is BACKWARD, which is why it’s the default.

As a consumer application, however, you don’t want to have to recompile your application to handle the addition of a new field. If you want to reference the customer by full name, that’s your choice in your app instead of being forced to consume the new field and use it. When you consume the new messages you’ve just produced, your application doesn’t crash or have problems, because it’s still using the prior version of the schema, and that schema change is compatible with your application. To experience this in action, run the consumer code in one window and don’t interrupt it. As you run the producer application again, this time with messages following the new schema, you can still see output without issue, thanks to the Glue Schema Registry.

Conclusion

In this post, we discussed the benefits of using the Glue Schema Registry to register, validate, and evolve schemas for data streams as business needs change. We also provided examples of how to use Glue Schema Registry with Apache Kafka and Kinesis Data Streams.

For more information and to get started, see AWS Glue Schema Registry.

About the Authors

Brian Likosar is a Senior Streaming Specialist Solutions Architect at Amazon Web Services. Brian loves helping customers capture value from real-time streaming architectures, because he knows life doesn’t happen in batch. He’s a big fan of open-source collaboration, theme parks, and live music.

Larry Heathcote is a Senior Product Marketing Manager at Amazon Web Services for data streaming and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys walking his Samoyed Sasha in the mornings so she can look for squirrels to bark at.

Using self-hosted Apache Kafka as an event source for AWS Lambda

2020-12-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-self-hosted-apache-kafka-as-an-event-source-for-aws-lambda/

Apache Kafka is an open source event streaming platform used to support workloads such as data pipelines and streaming analytics. Apache Kafka is a distributed streaming platform that it is conceptually similar to Amazon Kinesis.

With the launch of Kafka as an event source for Lambda, you can now consume messages from a topic in a Lambda function. This makes it easier to integrate your self-hosted Kafka clusters with downstream serverless workflows.

In this blog post, I explain how to set up an Apache Kafka cluster on Amazon EC2 and configure key elements in the networking configuration. I also show how to create a Lambda function to consume messages from a Kafka topic. Although the process is similar to using Amazon Managed Streaming for Apache Kafka (Amazon MSK) as an event source, there are also some important differences.

Overview

Using Kafka as an event source operates in a similar way to using Amazon SQS or Amazon Kinesis. In all cases, the Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides the message batches to your function in the event payload.

Lambda is a consumer application for your Kafka topic. It processes records from one or more partitions and sends the payload to the target function. Lambda continues to process batches until there are no more messages in the topic.

Configuring networking for self-hosted Kafka

It’s best practice to deploy the Amazon EC2 instances running Kafka in private subnets. For the Lambda function to poll the Kafka instances, you must ensure that there is a NAT Gateway running in the public subnet of each Region.

It’s possible to route the traffic to a single NAT Gateway in one AZ for test and development workloads. For redundancy in production workloads, it’s recommended that there is one NAT Gateway available in each Availability Zone. This walkthrough creates the following architecture:

Deploy a VPC with public and private subnets and a NAT Gateway that enables internet access. To configure this infrastructure with AWS CloudFormation, deploy this template.
From the VPC console, edit the default security group created by this template to provide inbound access to the following ports:
- Custom TCP: ports 2888–3888 from all sources.
- SSH (port 22), restricted to your own IP address.
- Custom TCP: port 2181 from all sources.
- Custom TCP: port 9092 from all sources.
- All traffic from the same security group identifier.

Deploying the EC2 instances and installing Kafka

Next, you deploy the EC2 instances using this network configuration and install the Kafka application:

From the EC2 console, deploy an instance running Ubuntu Server 18.04 LTS. Ensure that there is one instance in each private subnet, in different Availability Zones. Assign the default security group configured by the template.
Next, deploy another EC2 instance in either of the public subnets. This is a bastion host used to access the private instances. Assign the default security group configured by the template.
Connect to the bastion host, then SSH to the first private EC2 instance using the method for your preferred operating system. This post explains different methods. Repeat the process in another terminal for the second private instance.

On each instance, install Java:

sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install openjdk-8-jdk
java –version

On each instance, install Kafka:

wget http://www-us.apache.org/dist/kafka/2.3.1/kafka_2.12-2.3.1.tgz
tar xzf kafka_2.12-2.3.1.tgz
ln -s kafka_2.12-2.3.1 kafka

Configure and start Zookeeper

Configure and start the Zookeeper service that manages the Kafka brokers:

On the first instance, configure the Zookeeper ID:

cd kafka
mkdir /tmp/zookeeper
touch /tmp/zookeeper/myid
echo "1" >> /tmp/zookeeper/myid

Repeat the process on the second instance, using a different ID value:

cd kafka
mkdir /tmp/zookeeper
touch /tmp/zookeeper/myid
echo "2" >> /tmp/zookeeper/myid

On the first instance, edit the config/zookeeper.properties file, adding the private IP address of the second instance:

initLimit=5
syncLimit=2
tickTime=2000
# list of servers: <ip>:2888:3888
server.1=0.0.0.0:2888:3888 
server.2=<<IP address of second instance>>:2888:3888

On the second instance, edit the config/zookeeper.properties file, adding the private IP address of the first instance:

initLimit=5
syncLimit=2
tickTime=2000
# list of servers: <ip>:2888:3888
server.1=<<IP address of first instance>>:2888:3888 
server.2=0.0.0.0:2888:3888

On each instance, start Zookeeper:bin/zookeeper-server-start.sh config/zookeeper.properties

Configure and start Kafka

Configure and start the Kafka broker:

On the first instance, edit the config/server.properties file:
broker.id=1
zookeeper.connect=0.0.0.0:2181, =<<IP address of second instance>>:2181
On the second instance, edit the config/server.properties file:
broker.id=2
zookeeper.connect=0.0.0.0:2181, =<<IP address of first instance>>:2181
Start Kafka on each instance:
bin/kafka-server-start.sh config/server.properties

At the end of this process, Zookeeper and Kafka are running on both instances. If you use separate terminals, it looks like this:

Configuring and publishing to a topic

Kafka organizes channels of messages around topics, which are virtual groups of one or many partitions across Kafka brokers in a cluster. Multiple producers can send messages to Kafka topics, which can then be routed to and processed by multiple consumers. Producers publish to the tail of a topic and consumers read the topic at their own pace.

From either of the two instances:

Create a new topic called test:

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic test

Start a producer:

bin/kafka-console-producer.sh --broker-list localhost:9092 –topic

Enter test messages to check for successful publication:

At this point, you can successfully publish messages to your self-hosted Kafka cluster. Next, you configure a Lambda function as a consumer for the test topic on this cluster.

Configuring the Lambda function and event source mapping

You can create the Lambda event source mapping using the AWS CLI or AWS SDK, which provide the CreateEventSourceMapping API. In this walkthrough, you use the AWS Management Console to create the event source mapping.

Create a Lambda function that uses the self-hosted cluster and topic as an event source:

From the Lambda console, select Create function.
Enter a function name, and select Node.js 12.x as the runtime.
Select the Permissions tab, and select the role name in the Execution role panel to open the IAM console.

Choose Add inline policy and create a new policy called SelfHostedKafkaPolicy with the following permissions. Replace the resource example with the ARNs of your instances:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeVpcs",
                "ec2:DeleteNetworkInterface",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": " arn:aws:ec2:<REGION>:<ACCOUNT_ID>:instance/<instance-id>"
        }
    ]
}

Choose Create policy and ensure that the policy appears in Permissions policies.
Back in the Lambda function, select the Configuration tab. In the Designer panel, choose Add trigger.
In the dropdown, select Apache Kafka:
- For Bootstrap servers, add each of the two instances private IPv4 DNS addresses with port 9092 appended.
- For Topic name, enter ‘test’.
- Enter your preferred batch size and starting position values (see this documentation for more information).
- For VPC, select the VPC created by the template.
- For VPC subnets, select the two private subnets.
- For VPC security groups, select the default security group.
- Choose Add.

The trigger’s status changes to Enabled in the Lambda console after a few seconds. It then takes several minutes for the trigger to receive messages from the Kafka cluster.

Testing the Lambda function

At this point, you have created a VPC with two private and public subnets and a NAT Gateway. You have created a Kafka cluster on two EC2 instances in private subnets. You set up a target Lambda function with the necessary IAM permissions. Next, you publish messages to the test topic in Kafka and see the resulting invocation in the logs for the Lambda function.

In the Function code panel, replace the contents of index.js with the following code and choose Deploy:

exports.handler = async (event) => {
    // Iterate through keys
    for (let key in event.records) {
      console.log('Key: ', key)
      // Iterate through records
      event.records[key].map((record) => {
        console.log('Record: ', record)
        // Decode base64
        const msg = Buffer.from(record.value, 'base64').toString()
        console.log('Message:', msg)
      }) 
    }
}

Back in the terminal with the producer script running, enter a test message:
In the Lambda function console, select the Monitoring tab then choose View logs in CloudWatch. In the latest log stream, you see the original event and the decoded message:

Using Lambda as event source

The Lambda function target in the event source mapping does not need to be connected to a VPC to receive messages from the private instance hosting Kafka. However, you must provide details of the VPC, subnets, and security groups in the event source mapping for the Kafka cluster.

The Lambda function must have permission to describe VPCs and security groups, and manage elastic network interfaces. These execution roles permissions are:

ec2:CreateNetworkInterface
ec2:DescribeNetworkInterfaces
ec2:DescribeVpcs
ec2:DeleteNetworkInterface
ec2:DescribeSubnets
ec2:DescribeSecurityGroups

The event payload for the Lambda function contains an array of records. Each array item contains details of the topic and Kafka partition identifier, together with a timestamp and base64 encoded message:

There is an important difference in the way the Lambda service connects to the self-hosted Kafka cluster compared with Amazon MSK. MSK encrypts data in transit by default so the broker connection defaults to using TLS. With a self-hosted cluster, TLS authentication is not supported when using the Apache Kafka event source. Instead, if accessing brokers over the internet, the event source uses SASL/SCRAM authentication, which can be configured in the event source mapping:

To learn how to configure SASL/SCRAM authentication your self-hosted Kafka cluster, see this documentation.

Conclusion

Lambda now supports self-hosted Kafka as an event source so you can invoke Lambda functions from messages in Kafka topics to integrate into other downstream serverless workflows.

This post shows how to configure a self-hosted Kafka cluster on EC2 and set up the network configuration. I also cover how to set up the event source mapping in Lambda and test a function to decode the messages sent from Kafka.

To learn more about how to use this feature, read the documentation. For more serverless learning resource, visit Serverless Land.

Architecture overview

Prerequisites

Solution overview

Set up an MSK cluster and Amazon ECR

Build and upload application JAR files to Amazon ECR

Create an ECS cluster with a Fargate task and service definitions

Run the streaming application

Improvements, considerations, and best practices

Clean up

Conclusion

About the Author

Overview

Amazon MSK networking

VPC peering

AWS Transit Gateway

AWS PrivateLink

AWS PrivateLink connectivity pattern with a single NLB

AWS PrivateLink connectivity pattern with multiple NLBs

Conclusion

References

About the Authors

Overview of solution

Prerequisites

Walkthrough

1. Create an Apache Kafka cluster using Amazon MSK

2. Configure an Apache Kafka cluster for SASL_SSL authentication

3. Set up Kafka producer and consumer on AWS Cloud9 to test the setup

4. Configure an IoT Rule action to send a message to Kafka

Testing the data pipeline

Common troubleshooting checks

Cleaning up

Conclusion

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

MirrorMaker

Solution overview

Step 1: Set up an MSK cluster in the primary Region

Step 2: Set up an MSK cluster in the secondary Region

Step 3: Set up connectivity between the two MSK clusters

Step 4: Deploy Kafka Connect as containers using AWS Fargate

Step 5: Deploy MirrorMaker connectors on the Kafka Connect cluster

Step 6: Confirm the data is replicated from one Region to another

Step 7: Fail over clients to the secondary Region

Step 8: Fail back clients to the primary Region

Using the default replication policy in MirrorMaker 2

Conclusion

About the Author

Introduction

Architecture

Integration with Enterprise Data Lake

Cloning data into other application environments

Ready for a Test Drive

Conclusion

AWS Glue Schema Registry features

Example schema

Using AWS Glue Schema Registry with Amazon MSK and Apache Kafka

Using AWS Glue Schema Registry with Kinesis Data Streams

Example of schema evolution

Conclusion

About the Authors

Overview

Configuring networking for self-hosted Kafka

Deploying the EC2 instances and installing Kafka

Configure and start Zookeeper

Configure and start Kafka

Configuring and publishing to a topic

Configuring the Lambda function and event source mapping

Testing the Lambda function

Using Lambda as event source

Conclusion

The collective thoughts of the interwebz