Tag Archives: Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Introducing AWS Lambda batching controls for message broker services

2022-01-20 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-aws-lambda-batching-controls-for-message-broker-services/

This post is written by Mithun Mallick, Senior Specialist Solutions Architect.

AWS Lambda now supports configuring a maximum batch window for instance-based message broker services to fine tune when Lambda invocations occur. This feature gives you an additional control on batching behavior when processing data. It applies to Amazon Managed Streaming for Apache Kafka (Amazon MSK), self-hosted Apache Kafka, and Amazon MQ for Apache ActiveMQ and RabbitMQ.

Apache Kafka is an open source event streaming platform used to support workloads such as data pipelines and streaming analytics. It is conceptually similar to Amazon Kinesis. Amazon MSK is a fully managed, highly available service that simplifies the setup, scaling, and management of clusters running Kafka.

Amazon MQ is a managed, highly available message broker service for Apache ActiveMQ and RabbitMQ that makes it easier to set up and operate message brokers on AWS. Amazon MQ reduces your operational responsibilities by managing the provisioning, setup, and maintenance of message brokers for you.

Amazon MSK, self-hosted Apache Kafka and Amazon MQ for ActiveMQ and RabbitMQ are all available as event sources for AWS Lambda. You configure an event source mapping to use Lambda to process items from a stream or queue. This allows you to use these message broker services to store messages and asynchronously integrate them with downstream serverless workflows.

In this blog, I explain how message batching works. I show how to use the new maximum batching window control for the managed message broker services and self-managed Apache Kafka.

Understanding batching

For event source mappings, the Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides these to your function as an event payload. Batching allows higher throughput message processing, up to 10,000 messages in a batch. The payload limit of a single invocation is 6 MB.

Previously, you could only use batch size to configure the maximum number of messages Lambda would poll for. Once a defined batch size is reached, the poller invokes the function with the entire set of messages. This feature is ideal when handling a low volume of messages or batches of data that take time to build up.

Batching window

The new Batch Window control allows you to set the maximum amount of time, in seconds, that Lambda spends gathering records before invoking the function. This brings similar batching functionality that AWS supports with Amazon SQS to Amazon MQ, Amazon MSK and self-managed Apache Kafka. The Lambda event source mapping batching functionality can be described as follows.

Batching controls with Lambda event source mapping

Using MaximumBatchingWindowInSeconds, you can set your function to wait up to 300 seconds for a batch to build before processing it. This allows you to create bigger batches if there are enough messages. You can manage the average number of records processed by the function with each invocation. This increases the efficiency of each invocation, and reduces the frequency.

Setting MaximumBatchingWindowInSeconds to 0 invokes the target Lambda function as soon as the Lambda event source receives a message from the broker.

Message broker batching behavior

For ActiveMQ, the Lambda event source mapping uses the Java Message Service (JMS) API to receive messages. For RabbitMQ, Lambda uses a RabbitMQ client library to get messages from the queue.

The Lambda event source mappings act as a consumer when polling the queue. The batching pattern for all instance-based message broker services is the same. As soon as a message is received, the batching window timer starts. If there are more messages, the consumer makes additional calls to the broker and adds them to a buffer. It keeps a count of the number of messages and the total size of the payload.

The batch is considered complete if the addition of a new message makes the batch size equal to or greater than 6 MB, or the batch window timeout is reached. If the batch size is greater than 6 MB, the last message is returned back to the broker.

Lambda then invokes the target Lambda function synchronously and passes on the batch of messages to the function. The Lambda event source continues to poll for more messages and as soon as it retrieves the next message, the batching window starts again. Polling and invocation of the target Lambda function occur in separate processes.

Kafka uses a distributed append log architecture to store messages. This works differently from ActiveMQ and RabbitMQ as messages are not removed from the broker once they have been consumed. Instead, consumers must maintain an offset to the last record or message that was consumed from the broker. Kafka provides several options in the consumer API to simplify the tracking of offsets.

Amazon MSK and Apache Kafka store data in multiple partitions to provide higher scalability. Lambda reads the messages sequentially for each partition and a batch may contain messages from different partitions. Lambda then commits the offsets once the target Lambda function is invoked successfully.

Configuring the maximum batching window

To reduce Lambda function invocations for existing or new functions, set the MaximumBatchingWindowInSeconds value close to 300 seconds. A longer batching window can introduce additional latency. For latency-sensitive workloads set the MaximumBatchingWindowInSeconds value to an appropriate setting.

To configure Maximum Batching on a function in the AWS Management Console, navigate to the function in the Lambda console. Create a new Trigger, or edit an existing once. Along with the Batch size you can configure a Batch window. The Trigger Configuration page is similar across the broker services.

Max batching trigger window

You can also use the AWS CLI to configure the --maximum-batching-window-in-seconds parameter.

For example, with Amazon MQ:

aws lambda create-event-source-mapping --function-name my-function \
--maximum-batching-window-in-seconds 300 --batch-size 100 --starting-position AT_TIMESTAMP \
--event-source-arn arn:aws:mq:us-east-1:123456789012:broker:ExampleMQBroker:b-24cacbb4-b295-49b7-8543-7ce7ce9dfb98

You can use AWS CloudFormation to configure the parameter. The following example configures the MaximumBatchingWindowInSeconds as part of the AWS::Lambda::EventSourceMapping resource for Amazon MQ:

  LambdaFunctionEventSourceMapping:
    Type: AWS::Lambda::EventSourceMapping
    Properties:
      BatchSize: 10
      MaximumBatchingWindowInSeconds: 300
      Enabled: true
      Queues:
        - "MyQueue"
      EventSourceArn: !GetAtt MyBroker.Arn
      FunctionName: !GetAtt LambdaFunction.Arn
      SourceAccessConfigurations:
        - Type: BASIC_AUTH
          URI: !Ref secretARNParameter

You can also use AWS Serverless Application Model (AWS SAM) to configure the parameter as part of the Lambda function event source.

MQReceiverFunction:
      Type: AWS::Serverless::Function 
      Properties:
        FunctionName: MQReceiverFunction
        CodeUri: src/
        Handler: app.lambda_handler
        Runtime: python3.9
        Events:
          MQEvent:
            Type: MQ
            Properties:
              Broker: !Ref brokerARNParameter
              BatchSize: 10
              MaximumBatchingWindowInSeconds: 300
              Queues:
                - "workshop.queueC"
              SourceAccessConfigurations:
                - Type: BASIC_AUTH
                  URI: !Ref secretARNParameter

Error handling

If your function times out or returns an error for any of the messages in a batch, Lambda retries the whole batch until processing succeeds or the messages expire.

When a function encounters an unrecoverable error, the event source mapping is paused and the consumer stops processing records. Any other consumers can continue processing, provided that they do not encounter the same error. If your Lambda event records exceed the allowed size limit of 6 MB, they can go unprocessed.

For Amazon MQ, you can redeliver messages when there’s a function error. You can configure dead-letter queues (DLQs) for both Apache ActiveMQ, and RabbitMQ. For RabbitMQ, you can set a per-message TTL to move failed messages to a DLQ.

Since the same event may be received more than once, functions should be designed to be idempotent. This means that receiving the same event multiple times does not change the result beyond the first time the event was received.

Conclusion

Lambda supports a number of event sources including message broker services like Amazon MQ and Amazon MSK. This post explains how batching works with the event sources and how messages are sent to the Lambda function.

Previously, you could only control the batch size. The new Batch Window control allows you to set the maximum amount of time, in seconds, that Lambda spends gathering records before invoking the function. This can increase the overall throughput of message processing and reduces Lambda invocations, which may improve cost.

For more serverless learning resources, visit Serverless Land.

How Meshify Built an Insurance-focused IoT Solution on AWS

2022-01-03 Grant Fisher

Post Syndicated from Grant Fisher original https://aws.amazon.com/blogs/architecture/how-meshify-built-an-insurance-focused-iot-solution-on-aws/

The ability to analyze your Internet of Things (IoT) data can help you prevent loss, improve safety, boost productivity, and even develop an entirely new business model. This data is even more valuable, with the ever-increasing number of connected devices. Companies use Amazon Web Services (AWS) IoT services to build innovative solutions, including secure edge device connectivity, ingestion, storage, and IoT data analytics.

This post describes Meshify’s IoT sensor solution, built on AWS, that helps businesses and organizations prevent property damage and avoid loss for the property-casualty insurance industry. The solution uses real-time data insights, which result in fewer claims, better customer experience, and innovative new insurance products.

Through low-power, long-range IoT sensors, and dedicated applications, Meshify can notify customers of potential problems like rapid temperature decreases that could result in freeze damage, or rising humidity levels that could lead to mold. These risks can then be averted, instead of leading to costly damage that can impact small businesses and the insurer’s bottom line.

Architecture building blocks

The three building blocks of this technical architecture are the edge portfolio, data ingestion, and data processing and analytics, shown in Figure 1.

Figure 1. Building blocks of Meshify’s technical architecture

I. Edge portfolio (EP)

Starting with the edge sensors, the Meshify edge portfolio covers two types of sensors:

LoRaWAN (Low power, long range WAN) sensor suite. This sensor provides the long connectivity range (> 1000 feet) and extended battery life (~ 5 years) needed for enterprise environments.
Cellular-based sensors. This sensor is a narrow band/LTE-M device that operates at LTE-M band 2/4/12 radio frequency and uses edge intelligence to conserve battery life.

II. Data ingestion (DI)

For the LoRaWAN solution, aggregated sensor data at the Meshify gateway is sent to AWS using AWS IoT Core and Meshify’s REST service endpoints. AWS IoT Core is a managed cloud platform that lets IoT devices easily and securely connect using multiple protocols like HTTP, MQTT, and WebSockets. It expands its protocol coverage through a new fully managed feature called AWS IoT Core for LoRaWAN. This gives Meshify the ability to connect LoRaWAN wireless devices with the AWS Cloud. AWS IoT Core for LoRaWAN delivers a LoRaWAN network server (LNS) that provides gateway management using the Configuration and Update Server (CUPS) and Firmware Updates Over-The-Air (FUOTA) capabilities.

III. Data processing and analytics (DPA)

Initial processing of the data is done at the ingestion layer, using Meshify REST API endpoints and the Rules Engine of AWS IoT Core. Meshify applies filtering logic to route relevant events to Amazon Managed Streaming for Apache Kafka (Amazon MSK). Amazon MSK is an AWS streaming data service that manages Apache Kafka infrastructure and operations, streamlining the process of running Apache Kafka applications on AWS.

Meshify’s applications then consume the events from Amazon MSK per the configured topic subscription. They enrich and correlate the events with the records with a managed service, Amazon Relational Database Service (RDS). These applications run as scalable containers on another managed service, Amazon Elastic Kubernetes Service (EKS), which runs container applications.

Bringing it all together – technical workflow

In Figure 2, we illustrate the technical workflow from the ingestion of field events to their processing, enrichment, and persistence. Finally, we use these events to power risk avoidance decision-making.

Figure 2. Technical workflow for Meshify IoT architecture

After installation, Meshify-designed LoRa sensors transmit information to the cloud through Meshify’s gateways. LoRaWAN capabilities create connectivity between the sensors and the gateways. They establish a low power, wide area network protocol that securely transmits data over a long distance, through walls and floors of even the largest buildings.
The Meshify Gateway is a redundant edge system, capable of sending sensor data from various sensors to the Meshify cloud environment. Once the LoRa sensor information is received by the Meshify Gateway, it converts the incoming radio frequency (RF) signals, which support faster transfer rate to Meshify’s cloud environment.
Data from the Meshify Gateway and sensors is initially processed at Meshify’s AWS IoT Core and REST service endpoints. These destinations for IoT streaming data help with the initial intake and introduce field data to the Meshify cloud environment. The initial ingestion points can scale automatically based upon the volume of sensor data received. This enables rapid scaling and ease of implementation.
After the data has entered the Meshify cloud environment, Meshify uses Amazon EKS and Amazon MSK to process the incoming data stream. Amazon MSK producer and consumer applications within the EKS systems enrich the data streams for the end users and systems to consume.
Producer applications running on EKS send processed events to the Amazon MSK service. These events include storing and retrieval of raw data, enriched data, and system-level data.
Consumer applications hosted on the EKS pods receive events per the subscribed Amazon MSK topic. Web, mobile, and analytic applications enrich and use these data streams to display data to end users, business teams, and systems operations.
Processed events are persisted in Amazon RDS. The databases are used for reporting, machine learning, and other analytics and processing services.

Building a scalable IoT solution

Meshify first began work on the Meshify sensors and hosted platform in 2012. In the ensuing decade, Meshify has successfully created a platform to auto-scale upon demand with steady, predictable performance. This gave Meshify both the ability to use only the resources needed, and still have the capacity to handle unexpected voluminous data.

As the platform scaled, so did the volume of sensor data, operations and diagnostics data, and metadata from installations and deployments. Building an end-to-end data pipeline that integrates these different data sources and delivers co-related insights at low latency was time well spent.

Conclusion

In this post, we’ve shown how Meshify is using AWS services to power their suite of IoT sensors, software, and data platforms. Meshify’s most important architectural enhancements have involved the introduction of managed services, notably AWS IoT Core for LoRaWAN and Amazon MSK. These improvements have primarily focused on the data ingestion, data processing, and analytics stages.

Meshify continues to power the data revolution at the intersection of IoT and insurance at the edge, using AWS. Looking ahead, Meshify and HSB are excited at the prospect of scaling the relationship with AWS from cloud computing to the world of edge devices.

Learn more about how emerging startups and large enterprises are using AWS IoT services to build differentiated products.

Meshify is an IoT technology company and subsidiary of HSB, based in Austin, TX. Meshify builds pioneering sensor hardware, software, and data analytics solutions that protect businesses from property and equipment damage.

Offset lag metric for Amazon MSK as an event source for Lambda

2021-11-23 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/offset-lag-metric-for-amazon-msk-as-an-event-source-for-lambda/

This post written by Adam Wagner, Principal Serverless Solutions Architect.

Last year, AWS announced support for Amazon Managed Streaming for Apache Kafka (MSK) and self-managed Apache Kafka clusters as event sources for AWS Lambda. Today, AWS adds a new OffsetLag metric to Lambda functions with MSK or self-managed Apache Kafka event sources.

Offset in Apache Kafka is an integer that marks the current position of a consumer. OffsetLag is the difference in offset between the last record written to the Kafka topic and the last record processed by Lambda. Kafka expresses this in the number of records, not a measure of time. This metric provides visibility into whether your Lambda function is keeping up with the records added to the topic it is processing.

This blog walks through using the OffsetLag metric along with other Lambda and MSK metrics to understand your streaming application and optimize your Lambda function.

Overview

In this example application, a producer writes messages to a topic on the MSK cluster that is an event source for a Lambda function. Each message contains a number and the Lambda function finds the factors of that number. It outputs the input number and results to an Amazon DynamoDB table.

Finding all the factors of a number is fast if the number is small but takes longer for larger numbers. This difference means the size of the number written to the MSK topic influences the Lambda function duration.

Example application architecture

A Kafka client writes messages to a topic in the MSK cluster.
The Lambda event source polls the MSK topic on your behalf for new messages and triggers your Lambda function with batches of messages.
The Lambda function factors the number in each message and then writes the results to DynamoDB.

In this application, several factors can contribute to offset lag. The first is the volume and size of messages. If more messages are coming in, the Lambda may take longer to process them. Other factors are the number of partitions in the topic, and the number of concurrent Lambda functions processing messages. A full explanation of how Lambda concurrency scales with the MSK event source is in the documentation.

If the average duration of your Lambda function increases, this also tends to increase the offset lag. This lag could be latency in a downstream service or due to the complexity of the incoming messages. Lastly, if your Lambda function errors, the MSK event source retries the identical records set until they succeed. This retry functionality also increases offset lag.

Measuring OffsetLag

To understand how the new OffsetLag metric works, you first need a working MSK topic as an event source for a Lambda function. Follow this blog post to set up an MSK instance.

To find the OffsetLag metric, go to the CloudWatch console, select All Metrics from the left-hand menu. Then select Lambda, followed by By Function Name to see a list of metrics by Lambda function. Scroll or use the search bar to find the metrics for this function and select OffsetLag.

OffsetLag metric example

To make it easier to look at multiple metrics at once, create a CloudWatch dashboard starting with the OffsetLag metric. Select Actions -> Add to Dashboard. Select the Create new button, provide the dashboard a name. Choose Create, keeping the rest of the options at the defaults.

Adding OffsetLag to dashboard

After choosing Add to dashboard, the new dashboard appears. Choose the Add widget button to add the Lambda duration metric from the same function. Add another widget that combines both Lambda errors and invocations for the function. Finally, add a widget for the BytesInPerSec metric for the MSK topic. Find this metric under AWS/Kafka -> Broker ID, Cluster Name, Topic. Finally, click Save dashboard.

After a few minutes, you see a steady stream of invocations, as you would expect when consuming from a busy topic.

Data incoming to dashboard

This example is a CloudWatch dashboard showing the Lambda OffsetLag, Duration, Errors, and Invocations, along with the BytesInPerSec for the MSK topic.

In this example, the OffSetLag metric is averaging about eight, indicating that the Lambda function is eight records behind the latest record in the topic. While this is acceptable, there is room for improvement.

The first thing to look for is Lambda function errors, which can drive up offset lag. The metrics show that there are no errors so the next step is to evaluate and optimize the code.

The Lambda handler function loops through the records and calls the process_msg function on each record:

def lambda_handler(event, context):
    for batch in event['records'].keys():
        for record in event['records'][batch]:
            try:
                process_msg(record)
            except:
                print("error processing record:", record)
    return()

The process_msg function handles base64 decoding, calls a factor function to factor the number, and writes the record to a DynamoDB table:

def process_msg(record):
    #messages are base64 encoded, so we decode it here
    msg_value = base64.b64decode(record['value']).decode()
    msg_dict = json.loads(msg_value)
    #using the number as the hash key in the dynamodb table
    msg_id = f"{msg_dict['number']}"
    if msg_dict['number'] <= MAX_NUMBER:
        factors = factor_number(msg_dict['number'])
        print(f"number: {msg_dict['number']} has factors: {factors}")
        item = {'msg_id': msg_id, 'msg':msg_value, 'factors':factors}
        resp = ddb_table.put_item(Item=item)
    else:
        print(f"ERROR: {msg_dict['number']} is >= limit of {MAX_NUMBER}")

The heavy computation takes place in the factor function:

def factor(number):
    factors = [1,number]
    for x in range(2, (int(1 + number / 2))):
        if (number % x) == 0:
            factors.append(x)
    return factors

The code loops through all numbers up to the factored number divided by two. The code is optimized by only looping up to the square root of the number.

def factor(number):
    factors = [1,number]
    for x in range(2, 1 + int(number**0.5)):
        if (number % x) == 0:
            factors.append(x)
            factors.append(number // x)
    return factors

There are further optimizations and libraries for factoring numbers but this provides a noticeable performance improvement in this example.

Data after optimization

After deploying the code, refresh the metrics after a while to see the improvements:

The average Lambda duration has dropped to single-digit milliseconds and the OffsetLag is now averaging two.

If you see a noticeable change in the OffsetLag metric, there are several things to investigate. The input side of the system, increased messages per second, or a significant increase in the size of the message are a few options.

Conclusion

This post walks through implementing the OffsetLag metric to understand latency between the latest messages in the MSK topic and the records a Lambda function is processing. It also reviews other metrics that help understand the underlying cause of increases to the offset lag. For more information on this topic, refer to the documentation and other MSK Lambda metrics.

For more serverless learning resources, visit Serverless Land.

Introducing mutual TLS authentication for Amazon MSK as an event source

2021-11-19 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-mutual-tls-authentication-for-amazon-msk-as-an-event-source/

This post is written by Uma Ramadoss, Senior Specialist Solutions Architect, Integration.

Today, AWS Lambda is introducing mutual TLS (mTLS) authentication for Amazon Managed Streaming for Apache Kafka (Amazon MSK) and self-managed Kafka as an event source.

Many customers use Amazon MSK for streaming data from multiple producers. Multiple subscribers can then consume the streaming data and build data pipelines, analytics, and data integration. To learn more, read Using Amazon MSK as an event source for AWS Lambda.

You can activate any combination of authentication modes (mutual TLS, SASL SCRAM, or IAM access control) on new or existing clusters. This is useful if you are migrating to a new authentication mode or must run multiple authentication modes simultaneously. Lambda natively supports consuming messages from both self-managed Kafka and Amazon MSK through event source mapping.

By default, the TLS protocol only requires a server to authenticate itself to the client. The authentication of the client to the server is managed by the application layer. The TLS protocol also offers the ability for the server to request that the client send an X.509 certificate to prove its identity. This is called mutual TLS as both parties are authenticated via certificates with TLS.

Mutual TLS is a commonly used authentication mechanism for business-to-business (B2B) applications. It’s used in standards such as Open Banking, which enables secure open API integrations for financial institutions. It is one of the popular authentication mechanisms for customers using Kafka.

To use mutual TLS authentication for your Kafka-triggered Lambda functions, you provide a signed client certificate, the private key for the certificate, and an optional password if the private key is encrypted. This establishes a trust relationship between Lambda and Amazon MSK or self-managed Kafka. Lambda supports self-signed server certificates or server certificates signed by a private certificate authority (CA) for self-managed Kafka. Lambda trusts the Amazon MSK certificate by default as the certificates are signed by Amazon Trust Services CAs.

This blog post explains how to set up a Lambda function to process messages from an Amazon MSK cluster using mutual TLS authentication.

Overview

Using Amazon MSK as an event source operates in a similar way to using Amazon SQS or Amazon Kinesis. You create an event source mapping by attaching Amazon MSK as event source to your Lambda function.

The Lambda service internally polls for new records from the event source, reading the messages from one or more partitions in batches. It then synchronously invokes your Lambda function, sending each batch as an event payload. Lambda continues to process batches until there are no more messages in the topic.

The Lambda function’s event payload contains an array of records. Each array item contains details of the topic and Kafka partition identifier, together with a timestamp and base64 encoded message.

Kafka event payload

You store the signed client certificate, the private key for the certificate, and an optional password if the private key is encrypted in the AWS Secrets Manager as a secret. You provide the secret in the Lambda event source mapping.

The steps for using mutual TLS authentication for Amazon MSK as event source for Lambda are:

Create a private certificate authority (CA) using AWS Certificate Manager (ACM) Private Certificate Authority (PCA).
Create a client certificate and private key. Store them as secret in AWS Secrets Manager.
Create an Amazon MSK cluster and a consuming Lambda function using the AWS Serverless Application Model (AWS SAM).
Attach the event source mapping.

This blog walks through these steps in detail.

Prerequisites

Install AWS Command Line Interface (CLI) and AWS SAM CLI.
Install OpenSSL, jq, npm, and Git.

1. Creating a private CA.

To use mutual TLS client authentication with Amazon MSK, create a root CA using AWS ACM Private Certificate Authority (PCA). We recommend using independent ACM PCAs for each MSK cluster when you use mutual TLS to control access. This ensures that TLS certificates signed by PCAs only authenticate with a single MSK cluster.

From the AWS Certificate Manager console, choose Create a Private CA.
In the Select CA type panel, select Root CA and choose Next.

Select Root CA

In the Configure CA subject name panel, provide your certificate details, and choose Next.

Provide your certificate details

From the Configure CA key algorithm panel, choose the key algorithm for your CA and choose Next.

Configure CA key algorithm

From the Configure revocation panel, choose any optional certificate revocation options you require and choose Next.

Configure revocation

Continue through the screens to add any tags required, allow ACM to renew certificates, review your options, and confirm pricing. Choose Confirm and create.
Once the CA is created, choose Install CA certificate to activate your CA. Configure the validity of the certificate and the signature algorithm and choose Next.

Configure certificate

Review the certificate details and choose Confirm and install. Note down the Amazon Resource Name (ARN) of the private CA for the next section.

Review certificate details

2. Creating a client certificate.

You generate a client certificate using the root certificate you previously created, which is used to authenticate the client with the Amazon MSK cluster using mutual TLS. You provide this client certificate and the private key as AWS Secrets Manager secrets to the AWS Lambda event source mapping.

On your local machine, run the following command to create a private key and certificate signing request using OpenSSL. Enter your certificate details. This creates a private key file and a certificate signing request file in the current directory.

openssl req -new -newkey rsa:2048 -days 365 -keyout key.pem -out client_cert.csr -nodes

OpenSSL create a private key and certificate signing request

Use the AWS CLI to sign your certificate request with the private CA previously created. Replace Private-CA-ARN with the ARN of your private CA. The certificate validity value is set to 300, change this if necessary. Save the certificate ARN provided in the response.

aws acm-pca issue-certificate --certificate-authority-arn Private-CA-ARN --csr fileb://client_cert.csr --signing-algorithm "SHA256WITHRSA" --validity Value=300,Type="DAYS"

Retrieve the certificate that ACM signed for you. Replace the Private-CA-ARN and Certificate-ARN with the ARN you obtained from the previous commands. This creates a signed certificate file called client_cert.pem.

aws acm-pca get-certificate --certificate-authority-arn Private-CA-ARN --certificate-arn Certificate-ARN | jq -r '.Certificate + "\n" + .CertificateChain' >> client_cert.pem

Create a new file called secret.json with the following structure

{
"certificate":"",
"privateKey":""
}

Copy the contents of the client_cert.pem in certificate and the content of key.pem in privatekey. Ensure that there are no extra spaces added. The file structure looks like this:

Certificate file structure

Create the secret and save the ARN for the next section.

aws secretsmanager create-secret --name msk/mtls/lambda/clientcert --secret-string file://secret.json

3. Setting up an Amazon MSK cluster with AWS Lambda as a consumer.

Amazon MSK is a highly available service, so it must be configured to run in a minimum of two Availability Zones in your preferred Region. To comply with security best practice, the brokers are usually configured in private subnets in each Region.

You can use AWS CLI, AWS Management Console, AWS SDK and AWS CloudFormation to create the cluster and the Lambda functions. This blog uses AWS SAM to create the infrastructure and the associated code is available in the GitHub repository.

The AWS SAM template creates the following resources:

Amazon Virtual Private Cloud (VPC).
Amazon MSK cluster with mutual TLS authentication.
Lambda function for consuming the records from the Amazon MSK cluster.
IAM roles.
Lambda function for testing the Amazon MSK integration by publishing messages to the topic.

The VPC has public and private subnets in two Availability Zones with the private subnets configured to use a NAT Gateway. You can also set up VPC endpoints with PrivateLink to allow the Amazon MSK cluster to communicate with Lambda. To learn more about different configurations, see this blog post.

The Lambda function requires permission to describe VPCs and security groups, and manage elastic network interfaces to access the Amazon MSK data stream. The Lambda function also needs two Kafka permissions: kafka:DescribeCluster and kafka:GetBootstrapBrokers. The policy template AWSLambdaMSKExecutionRole includes these permissions. The Lambda function also requires permission to get the secret value from AWS Secrets Manager for the secret you configure in the event source mapping.

  ConsumerLambdaFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaMSKExecutionRole
      Policies:
        - PolicyName: SecretAccess
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action: "SecretsManager:GetSecretValue"
                Resource: "*"

This release adds two new SourceAccessConfiguration types to the Lambda event source mapping:

1. CLIENT_CERTIFICATE_TLS_AUTH – (Amazon MSK, Self-managed Apache Kafka) The Secrets Manager ARN of your secret key containing the certificate chain (PEM), private key (PKCS#8 PEM), and private key password (optional) used for mutual TLS authentication of your Amazon MSK/Apache Kafka brokers. A private key password is required if the private key is encrypted.

2. SERVER_ROOT_CA_CERTIFICATE – This is only for self-managed Apache Kafka. This contains the Secrets Manager ARN of your secret containing the root CA certificate used by your Apache Kafka brokers in PEM format. This is not applicable for Amazon MSK as Amazon MSK brokers use public AWS Certificate Manager certificates which are trusted by AWS Lambda by default.

Deploying the resources:

To deploy the example application:

Clone the GitHub repository

git clone https://github.com/aws-samples/aws-lambda-msk-mtls-integration.git

Navigate to the aws-lambda-msk-mtls-integration directory. Copy the client certificate file and the private key file to the producer lambda function code.

cd aws-lambda-msk-mtls-integration
cp ../client_cert.pem code/producer/client_cert.pem
cp ../key.pem code/producer/client_key.pem

Navigate to the code directory and build the application artifacts using the AWS SAM build command.

cd code
sam build

Run sam deploy to deploy the infrastructure. Provide the Stack Name, AWS Region, ARN of the private CA created in section 1. Provide additional information as required in the sam deploy and deploy the stack.

sam deploy -g

Running sam deploy -g

The stack deployment takes about 30 minutes to complete. Once complete, note the output values.

Create the event source mapping for the Lambda function. Replace the CONSUMER_FUNCTION_NAME and MSK_CLUSTER_ARN from the output of the stack created by the AWS SAM template. Replace SECRET_ARN with the ARN of the AWS Secrets Manager secret created previously.

aws lambda create-event-source-mapping --function-name CONSUMER_FUNCTION_NAME --batch-size 10 --starting-position TRIM_HORIZON --topics exampleTopic --event-source-arn MSK_CLUSTER_ARN --source-access-configurations '[{"Type": "CLIENT_CERTIFICATE_TLS_AUTH","URI": "SECRET_ARN"}]'

Navigate one directory level up and configure the producer function with the Amazon MSK broker details. Replace the PRODUCER_FUNCTION_NAME and MSK_CLUSTER_ARN from the output of the stack created by the AWS SAM template.

cd ../
./setup_producer.sh MSK_CLUSTER_ARN PRODUCER_FUNCTION_NAME

Verify that the event source mapping state is enabled before moving on to the next step. Replace UUID from the output of step 5.

aws lambda get-event-source-mapping --uuid UUID

Publish messages using the producer. Replace PRODUCER_FUNCTION_NAME from the output of the stack created by the AWS SAM template. The following command creates a Kafka topic called exampleTopic and publish 100 messages to the topic.

./produce.sh PRODUCER_FUNCTION_NAME exampleTopic 100

Verify that the consumer Lambda function receives and processes the messages by checking in Amazon CloudWatch log groups. Navigate to the log group by searching for aws/lambda/{stackname}-MSKConsumerLambda in the search bar.

Consumer function log stream

Conclusion

Lambda now supports mutual TLS authentication for Amazon MSK and self-managed Kafka as an event source. You now have the option to provide a client certificate to establish a trust relationship between Lambda and MSK or self-managed Kafka brokers. It supports configuration via the AWS Management Console, AWS CLI, AWS SDK, and AWS CloudFormation.

To learn more about how to use mutual TLS Authentication for your Kafka triggered AWS Lambda function, visit AWS Lambda with self-managed Apache Kafka and Using AWS Lambda with Amazon MSK.

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

2021-10-29 Wallace Printz

Post Syndicated from Wallace Printz original https://aws.amazon.com/blogs/big-data/now-available-updated-guidance-on-the-data-analytics-lens-for-aws-well-architected-framework/

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. Today, we’re pleased to announce a completely revised and updated version of the Data Analytics Lens whitepaper.

Self-assess with Well-Architected design principles

The updated version of the Data Analytics Lens whitepaper has been revised to provide guidance to CxOs as well as all data personas. Within each of the five Well-Architected Pillars, we provide top-level design principles for CxOs to quickly identify areas for teams and fundamental rules that analytics workloads designers should follow. Each design principle is followed by a series of questions and best practices that architects and system designers can use to perform self-assessments. Additionally, the Data Analytics Lens includes suggestions that prescriptively explain steps to implement best practices useful for implementation teams.

For example, the Security Pillar design principle “Control data access” works with the best practice to build user identity solutions that uniquely identify people and systems. The associated suggestion for this best practice is to centralize workforce identities, which details how to use this principle and includes links to more documentation on the suggestion.

“Building Data Analytics platform or workloads is one of the complex architecture patterns. It involves multi-layered approach such as Data Ingestion, Data Landing, Transformation Layer, Analytical/Insight and Reporting. Choices of technology and service for each of these layers are wide. The AWS Well-Architected Analytics Lens helps us to design and validate with great confidence against each of the pillars. Now Cognizant Architects can perform assessments using the Data Analytics Lens to validate and help build secure, scalable and innovative data solutions for customers.”

– Supriyo Chakraborty, Principal Architect & Head of Data Engineering Guild, Cognizant Germany
– Somasundaram Janavikulam, Cloud Enterprise Architect & Well Architected Partner Program Lead, Cognizant

In addition to performing your own assessment, AWS can provide a guided experience through reviewing your workload with a Well-Architected Framework Review engagement. For customers building data analytics workloads with AWS Professional Services, our teams of Data Architects can perform assessments using the Data Analytics Lens during the project engagements. This provides you with an objective assessment of your workloads and guidance on future improvements. The integration is available now for customers of the AWS Data Lake launch offering, with additional Data Analytics offerings coming in 2022. Reach out to your AWS Account Team if you’d like to know more about these guided Reviews.

Updated architectural patterns and scenarios

In this version of the Data Analytics Lens, we have also revised the discussion of data analytics patterns and scenarios to keep up with the industry and modern data analytics practices. Each scenario includes sections on characteristics that help you plan when developing systems for that scenario, a reference architecture to visualize and explain how the components work together, and configuration notes to help you properly configure your solution.

This version covers the following topics:

Building a modern data architecture (formerly Lake House Architecture)
Organize around data domains by delivering data as a product using a data mesh
Efficiently and securely provide batch data processing
Use streaming ingest and stream processing for real-time workloads
Build operational analytics systems to improve business processes and performance
Provide data visualization securely and cost-effectively at scale

Changed from the first release, the machine learning and tenant analytics scenarios have been migrated to a separate Machine Learning Lens whitepaper and SaaS Lens whitepaper.

Conclusion

We expect this updated version will provide better guidance to validate your existing architectures, as well as provide recommendations for any gaps that identified.

For more information about building your own Well-Architected systems using the Data Analytics Lens, see the Data Analytics Lens whitepaper.

Special thanks to everyone across the AWS Solution Architecture and Data Analytics communities who contributed. These contributions encompassed diverse perspectives, expertise, and experiences in developing the new AWS Well-Architected Data Analytics Lens.

About the Authors

Wallace Printz is a Senior Solutions Architect based in Austin, Texas. He helps customers across Texas transform their businesses in the cloud. He has a background in semiconductors, R&D, and machine learning.

Indira Balakrishnan is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

Query your Amazon MSK topics interactively using Amazon Kinesis Data Analytics Studio

2021-10-05 Chinmayi Narasimhadevara

Post Syndicated from Chinmayi Narasimhadevara original https://aws.amazon.com/blogs/big-data/query-your-amazon-msk-topics-interactively-using-amazon-kinesis-data-analytics-studio/

Amazon Kinesis Data Analytics Studio makes it easy to analyze streaming data in real time and build stream processing applications powered by Apache Flink using standard SQL, Python, and Scala. With a few clicks on the AWS Management Console, you can launch a serverless notebook to query data streams and get results in seconds. Kinesis Data Analytics reduces the complexity of building and managing Apache Flink applications. Apache Flink is an open-source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for stream processing applications.

If you’re running Apache Flink workloads, you may experience the non-trivial challenge of developing your distributed stream processing applications without having true visibility into the steps your application performs for data processing. Kinesis Data Analytics Studio combines the ease of use of Apache Zeppelin notebooks with the power of the Apache Flink processing engine to provide advanced streaming analytics capabilities in a fully managed offering. This accelerates developing and running stream processing applications that continuously generate real-time insights.

In this post, we introduce you to Kinesis Data Analytics Studio and how to get started querying data interactively from an Amazon Managed Streaming for Kafka (Amazon MSK) cluster using SQL, Python, and Scala. We also demonstrate how to query data across different topics using Kinesis Data Analytics Studio. Kinesis Data Analytics Studio is also compatible with Amazon Kinesis Data Streams, Amazon Simple Storage Service (Amazon S3), and a variety of other data sources supported by Apache Flink.

Prerequisites

To get started, you must have the following prerequisites:

An MSK cluster
A data generator for populating data into the MSK cluster

To follow this guide and interact with your streaming data, you need a data stream with data flowing through.

Create and set up a Kafka cluster

You can create your Kafka cluster either using the Amazon MSK console or the following AWS Command Line Interface (AWS CLI) command. For console instructions, see Getting Started Using Amazon MSK and creating Studio notebook with MSK

You can either create topics and messages or use existing topics in the MSK cluster.

For this post, we have two topics in the MSK cluster, impressions and clicks, and they have the following fields in JSON format:

impressions – bid_id, campaign_id, country_code, creative_details, i_timestamp
clicks – correlation_id, tracker, c_timestamp

The correlation_id is the click correlation ID for a bid_id, so the field has common values across topics that we use for the join.

For the data in the MSK topic, we use the Amazon MSK Data Generator. Refer to the GitHub repo for setup and usage details. (We will be using the adtech.json sample for this blog)

The following are sample JSON records generated for the impressions topic:

{
   "country_code": "KN",
   "creative_details": "orchid",
   "i_timestamp": "Sat Jul 10 05:34:56 GMT 2021",
   "campaign_id": "1443403873",
   "bid_id": "0868262269"
}
{
   "country_code": "BO",
   "creative_details": "mint green",
   "i_timestamp": "Sat Jul 10 05:34:56 GMT 2021",
   "campaign_id": "1788762118",
   "bid_id": "1025543335"
}

The following are sample JSON records generated for the clicks topic:

{
   "c_timestamp": "Sat Jul 10 05:34:55 GMT 2021",
   "correlation_id": "0868262269",
   "tracker": "8q4rcfkbjnmicgo4rbw48xajokcm4xhcft7025ea1mt0htrfcvsgl1rusg8e8ez30p7orsmjx76vtrha2fi9qb3iaw8htd9uri9jauz64zdq8ldz7b0o8vzlkxs640hnwxgikpfvy5nno15c9etgrh79niku8hhtnxg94n03f2zci5ztv05jixu1r3p5yeehgm9kfd7szle9kikgo2xy5mlx09mmtlo9ndwqdznwjyj3yk02ufcwui1yvzveqfn"
}
{
   "c_timestamp": "Sat Jul 10 05:35:01 GMT 2021",
   "correlation_id": "0868262269",
   "tracker": "gfwq09yk0jwirg9mw60rrpu88h98tkd9xr645jsdoo7dwu24f8usha14uimtsfltvjmhl4i5rq24lz0aucqn6ji4da4xbo6db7lfezus7twhkw238dqw0pzdt98rn5lk8vf4tk6smkyyq38rhjaeh2ezsmlcg4v7im39u7knj10ofiint4fny0xcgqwta0uwq426oc21b1t8m446tmc6fyy7ops80xonzbzfc4a1xjd4x56x81uyg80dxyu2g7v"
}

Create a Kinesis Data Analytics Studio notebook

You can start interacting with your data stream by following these simple steps:

On the Amazon MSK console, choose Process data in real time.
Choose Apache Flink – Studio Notebook.
Enter the name of your Kinesis Data Analytics Studio notebook and allow the notebook to create an AWS Identity and Access Management (IAM) role.

You can create a custom role for specific use cases on the IAM console.

Choose an AWS Glue database to store the metadata around your sources and destinations, which the notebook uses.
Choose Create Studio notebook.

We keep the default settings for the application and can scale up as needed.

After you create the application, choose Start to start the Apache Flink application.
When it’s complete (after a few minutes), choose Open in Apache Zeppelin.

To connect to an MSK cluster, you must specify the same VPC, subnets, and security groups for the Kinesis Data Analytics Studio notebook as were used to create the MSK cluster. If you chose Process data in real time during your setup, this is already set for you.

The Studio notebook is created with an IAM role for the notebook that grants the necessary access for the AWS Glue Data Catalog and tables.

Example applications

Apache Zeppelin supports the Apache Flink interpreter and allows for the use of Apache Flink directly within Zeppelin for interactive data analysis. Within the Flink interpreter, three languages are supported as of this writing: Scala, Python (PyFlink), and SQL. The notebook requires a specification to one of these languages at the top of each paragraph in order to interpret the language properly:

%flink          - Scala environment 
%flink.pyflink  - Python Environment
%flink.ipyflink - ipython Environment
%flink.ssql     - Streaming SQL Environment
%flink.bsql     - Batch SQL Environment

There are several other predefined variables per interpreter, such as the senv variable in Scala for a StreamExecutionEnvironment, or st_env in Python for the same. You can review the full list of these entry point variables.

In this section, we show the same example code in all three languages to highlight the flexibility Zeppelin affords you for development.

SQL

We use the %flink.ssql(type=update) header to signify to the notebook that this paragraph will be interpreted as Flink SQL. We create two tables from the Kafka topics:

impressions – With bid_id, campaign_id, creative_details, country_code, and i_timestamp columns providing details of impressions in the system
clicks – With correlation_id, tracker, and c_timestamp providing details of the clicks for an impression.

The tables use the Kafka connector to read from a Kafka topic called impressions and clicks in the us-east-1 Region from the latest offset.

As soon as this statement runs within a Zeppelin notebook, AWS Glue Data Catalog tables are created according to the declaration specified in the create statement, and the tables are available immediately for queries from the MSK cluster.

You don’t need to complete this step if your AWS Glue Data Catalog already contains the tables.

%flink.ssql(type=update)
CREATE TABLE impressions (
bid_id VARCHAR,
creative_details VARCHAR(10),
campaign_id VARCHAR,
country_code VARCHAR(5),
i_timestamp VARCHAR,
serve_time as TO_TIMESTAMP (`i_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR serve_time AS serve_time -INTERVAL '5' SECOND
)
PARTITIONED BY (bid_id)
WITH (
'connector'= 'kafka',
'topic' = 'impressions',
'properties.bootstrap.servers' = '<bootstrap servers shown in the MSK client
info dialog>',
'format' = 'json',
'properties.group.id' = 'testGroup1',
'scan.startup.mode'= 'earliest-offset',
'json.timestamp-format.standard'= 'ISO-8601'
);

CREATE TABLE clicks (
correlation_id VARCHAR,
tracker VARCHAR(100),
c_timestamp VARCHAR,
click_time as TO_TIMESTAMP (`c_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR click_time AS click_time -INTERVAL '5' SECOND
)
PARTITIONED BY (correlation_id)
WITH (
'connector'= 'kafka',
'topic' = 'clicks',
'properties.bootstrap.servers' = '<bootstrap servers shown in the MSK client info dialog>',
'format' = 'json',
'properties.group.id' = 'testGroup1',
'scan.startup.mode'= 'earliest-offset',
'json.timestamp-format.standard'= 'ISO-8601'
);

The following screenshot is the AWS Glue Data Catalog view, which shows the tables that represent MSK topics.

In the preceding tables, WATERMARK FOR serve_time AS serve_time - INTERVAL '5' SECOND means that we can tolerate out-of-order delivery of events in the timeframe of 5 seconds and still produce correct results.

After you create the tables, run a query that calculates the number of impressions within a tumbling window of 60 seconds broken down by campaign_id and creative_details:

%flink.ssql(type=update)
SELECT 
 campaign_id, 
 creative_details, 
 TUMBLE_ROWTIME(serve_time, INTERVAL '60' SECOND) 
   AS window_end, COUNT(*) AS c
FROM impressions
GROUP BY 
  TUMBLE(serve_time, INTERVAL '60' SECOND), 
  campaign_id, 
  creative_details
ORDER BY window_end, c DESC;

The results from this query appear as soon as results are available.

Additionally, we want to see the clickthrough rate of the impressions:

SELECT 
  bid_id, 
  campaign_id,
  country_code, 
  creative_details, 
  CAST(serve_time AS TIMESTAMP) AS serveTime, 
  tracker,
  CAST(click_time AS TIMESTAMP) AS clickTime,
  CASE
     WHEN `click_time` IS NULL THEN FALSE
     WHEN `click_time` IS NOT NULL THEN TRUE
  END AS clicked
FROM  impressions 
LEFT OUTER JOIN clicks 
  ON bid_id = correlation_id AND
  click_time BETWEEN  serve_time AND 
  serve_time + INTERVAL '2' MINUTE ;

This query produces one row for each impression and matches it with a click (if any) that was observed within 2 minutes after serving the ad. This is essentially performing a join operation across the topics to get this information.

You can insert this data back into an existing Kafka topic using the following code:

INSERT INTO clickthroughrate 
SELECT 
  bid_id, 
  campaign_id,
  country_code, 
  creative_details, 
  CAST(serve_time AS TIMESTAMP WITHOUT TIME ZONE) AS serveTime, 
  tracker,
  CAST(click_time AS TIMESTAMP WITHOUT TIME ZONE) AS clickTime,
  CASE
     WHEN `click_time` IS NULL THEN FALSE
     WHEN `click_time` IS NOT NULL THEN TRUE
  END AS clicked
FROM  impressions 
LEFT OUTER JOIN clicks 
  ON bid_id = correlation_id AND
  click_time BETWEEN  serve_time AND 
  serve_time + INTERVAL '2' MINUTE ;

Create the corresponding table for the Kafka topic in the Data Catalog if it doesn’t exist already. After you run the preceding query, you can see data in your Amazon MSK topic (see the following sample below):

1095810839,1911670336,KH,"mint green","2021-06-15 01:08:00","ainhpsm6vxgs4gvyl52v13s173gntd7jyitlq328qmam37rpbs2tj1il049dlyb2vgwx89dbvwezl2vkcynqvlqfql7pxp8blg6807yxy1y54eedwff2nuhrbqhce36j00mbxdh72fpjmztymobq79y1g3xoyr6f09rgwqna1kbejkjw4nfddmm0d56g3mkd8obrrzo81z0ktu934a00b04e9q0h1krapotnon76rk0pmw6gr8c24wydp0b2yls","2021-06-15 01:08:07",true
0946058105,1913684520,GP,magenta,"2021-06-15 01:07:56","7mlkc1qm9ntazr7znfn9msew75xs9tf2af96ys8638l745t2hxwnmekaft735xdcuq4xtynpxr68orw5gmbrhr9zyevhawjwfbvzhlmziao3qs1grsb5rdzysvr5663qg2eqi5p7braruyb6rhyxkf4x3q5djo7e1jd5t91ybop0cxu4zqmwkq7x8l7c4y33kd4gwd4g0jmm1hy1df443gdq5tnj8m1qaymr0q9gatqt7jg61cznql0z6ix8pyr","2021-06-15 01:08:07",true
0920672086,0888784120,CK,silver,"2021-06-15 01:08:03","gqr76xyhu2dmtwpv9k3gxihvmn7rluqblh39gcrfyejt0w8jwwliq24okxkho1zuyxdw9mp4vzwi0nd4s5enhvm2d74eydtqnmf7fm4jsyuhauhh3d32esc8gzpbwkgs8yymlp22ih6kodrpjj2bayh4bjebcoeb42buzb43ii1e0zv19bxb8suwg17ut2mdhj4vmf8g9jl02p2tthe9w3rpv7w9w16d14bstiiviy4wcf86adfpz378a49f36q","2021-06-15 01:08:16",true

This is the CSV data from the preceding query, which shows the ClickThroughRate for the impressions. You can use this mechanism to store data back persistently into Kafka from Flink directly.

Scala

We use the %flink header to signify that this code block will be interpreted via the Scala Flink interpreter, and create a table identical to the one from the SQL example. However, in this example, we use the Scala interpreter’s built-in streaming table environment variable, stenv, to run a SQL DDL statement. If the table already exists in the AWS Glue Data Catalog, this statement issues an error stating that the table already exists.

%flink 
stenv.executeSql("""CREATE TABLE impressions (
  bid_id VARCHAR,
  creative_details VARCHAR(10),
  campaign_id VARCHAR,
  country_code VARCHAR(5),
  i_timestamp VARCHAR,
  serve_time as TO_TIMESTAMP (`i_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
  WATERMARK FOR serve_time AS serve_time -INTERVAL '5' SECOND
  )
  WITH (
  'connector'= 'kafka',
  'topic' = 'impressions',
  'properties.bootstrap.servers' = '< Bootstrap Servers shown in the MSK client info dialog >',
  'format' = 'json',
  'properties.group.id' = 'testGroup1',
  'scan.startup.mode'= 'earliest-offset',
  'json.timestamp-format.standard'= 'ISO-8601'
  )""")

stenv.executeSql("""
 CREATE TABLE clicks (
 correlation_id VARCHAR,
 tracker VARCHAR(100),
 c_timestamp VARCHAR,
 click_time as TO_TIMESTAMP (`c_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR click_time AS click_time -INTERVAL '5' SECOND
 )
 WITH (
 'connector'= 'kafka',
 'topic' = 'clicks',
 'properties.bootstrap.servers' = '< Bootstrap Servers shown in the MSK client info dialog >',
 'format' = 'json',
 'properties.group.id' = 'testGroup1',
 'scan.startup.mode'= 'earliest-offset',
 'json.timestamp-format.standard'= 'ISO-8601'
 )""")

Performing a tumbling window in the Scala table API first requires the definition of an in-memory reference to the table we created. We use the stenv variable to define this table using the from function and referencing the table name. After this is created, we can create a windowed aggregation over 1 minute of data, serve_time column. See the following code:

%flink
val inputTable: Table = stenv.from("impressions")
val tumblingWindowTable = inputTable.window(Tumble over 1.minute on $"serve_time" as $"oneMinuteWindow")
.groupBy( $"oneMinuteWindow", $"campaign_id",$"creative_details")
.select($"campaign_id", $"creative_details", $"oneMinuteWindow".rowtime as "window_end",$"creative_details".count as "c")

Use the ZeppelinContext to visualize the Scala table aggregation within the notebook:

%flink
z.show(tumblingWindowTable, streamType="update")

The following screenshot shows our results.

Additionally, we want to see the clickthrough rate of the impressions by joining with the clicks:

val left:Table = stenv.from("impressions").select("bid_id,campaign_id,country_code,creative_details,serve_time")
val right:Table = stenv.from("clicks").select("correlation_id,tracker,click_time")
val result:Table = left.leftOuterJoin(right).where($"bid_id" === $"correlation_id" && $"click_time" < ( $"serve_time" + 2.minutes) && $"click_time" > $"serve_time").select($"bid_id", $"campaign_id", $"country_code",$"creative_details",$"tracker",$"serve_time".cast(Types.SQL_TIMESTAMP) as "s_time", $"click_time".cast(Types.SQL_TIMESTAMP) as "c_time" , $"click_time".isNull.?("false","true") as "clicked" )

Use the ZeppelinContext to visualize the Scala table aggregation within the notebook.

z.show(result, streamType="update")

The following screenshot shows our results.

Python

We use the %flink.pyflink header to signify that this code block will be interpreted via the Python Flink interpreter, and create a table identical to the one from the SQL and Scala examples. In this example, we use the Python interpreter’s built-in streaming table environment variable, st_env, to run a SQL DDL statement. If the table already exists in the AWS Glue Data Catalog, this statement issues an error stating that the table already exists.

%flink.pyflink
st_env.execute_sql("""
 CREATE TABLE impressions (
 bid_id VARCHAR,
 creative_details VARCHAR(10),
 campaign_id VARCHAR,
 country_code VARCHAR(5),
 i_timestamp VARCHAR,
 serve_time as TO_TIMESTAMP (`i_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR serve_time AS serve_time -INTERVAL '5' SECOND
 )
 WITH (
 'connector'= 'kafka',
 'topic' = 'impressions',
 'properties.bootstrap.servers' = '< Bootstrap Servers shown in the MSK client info dialog >',
 'format' = 'json',
 'properties.group.id' = 'testGroup1',
 'scan.startup.mode'= 'earliest-offset',
 'json.timestamp-format.standard'= 'ISO-8601'
 )""")
 
st_env.execute_sql("""
 CREATE TABLE clicks (
 correlation_id VARCHAR,
 tracker VARCHAR(100),
 c_timestamp VARCHAR,
 click_time as TO_TIMESTAMP (`c_timestamp`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR click_time AS click_time -INTERVAL '5' SECOND
 )
 WITH (
 'connector'= 'kafka',
 'topic' = 'clicks',
 'properties.bootstrap.servers' = '< Bootstrap Servers shown in the MSK client info dialog >',
 'format' = 'json',
 'properties.group.id' = 'testGroup1',
 'scan.startup.mode'= 'earliest-offset',
 'json.timestamp-format.standard'= 'ISO-8601'
 )""")

Performing a sliding (hopping) window in the Python table API first requires the definition of an in-memory reference to the table we created. We use the st_env variable to define this table using the from_path function and referencing the table name. After this is created, we can create a windowed aggregation over 1 minute of data, emitting results every 5 seconds according to the event_time column. See the following code:

%flink.pyflink

input_table = st_env.from_path("impressions")
tumbling_window_table =(input_table.window(Tumble.over("1.minute").on("serve_time").alias("one_minute_window"))
.group_by( "one_minute_window, campaign_id, creative_details")
.select("campaign_id, creative_details, one_minute_window.end as window_end, creative_details.count as c"))

Use the ZeppelinContext to visualize the Python table aggregation within the notebook:

%flink.pyflink

z.show(tumbling_window_table, stream_type="update")

The following screenshot shows our results.

Additionally, we want to see the clickthrough rate of the impressions by joining with the clicks:

impressions = st_env.from_path("impressions").select("bid_id,campaign_id,country_code,creative_details,serve_time")
clicks = st_env.from_path("clicks").select("correlation_id,tracker,click_time")
results = impressions.left_outer_join(clicks).where("bid_id == correlation_id && click_time < (serve_time + 2.minutes) && click_time > serve_time").select("bid_id, campaign_id, country_code, creative_details, tracker, serve_time.cast(STRING) as s_time, click_time.cast(STRING) as c_time, (click_time.isNull).?('false','true') as clicked")

Scaling

A Studio notebook consists of one or more tasks. You can split a Studio notebook task into several parallel instances to run, where each parallel instance processes a subset of the task’s data. The number of parallel instances of a task is called its parallelism, and adjusting that helps run your tasks more efficiently.

On creation, Studio notebooks are given four parallel Kinesis Processing Units (KPUs), which make up the application parallelism. To increase that parallelism, navigate to the Kinesis Data Analytics console, choose your application name, and choose the Configuration tab.

From this page, in the Scaling section, choose Edit and modify the Parallelism entry. We don’t recommend increasing the Parallelism Per KPU setting higher than 1 unless your application is I/O bound.

Choose Save changes to increase or decrease your application’s parallelism.

Clean up

You may want to clean up the demo environment when you are done, To do so, stop the Studio notebook and delete the resources created for the Data Generator and the Amazon MSK cluster ( if you created a new cluster).

Summary

Kinesis Data Analytics Studio makes developing stream processing applications using Apache Flink much faster, with rich visualizations, a scalable and user-friendly interface to develop pipelines, and the flexibility of language choice to make any streaming workload performant and powerful. You can run paragraphs from within the notebook or promote your Studio notebook to a Kinesis Data Analytics for Apache Flink application with a durable state, as shown in the SQL example in this post.

For more information, see the following resources:

About the Author

Chinmayi Narasimhadevara is a Solutions Architect focused on Big Data and Analytics at Amazon Web Services. Chinmayi has over 15 years of experience in information technology. She helps AWS customers build advanced, highly scalable and performant solutions

Introducing Amazon MSK Connect – Stream Data to and from Your Apache Kafka Clusters Using Managed Connectors

2021-09-17 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/

Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. At re:Invent 2018, we announced Amazon Managed Streaming for Apache Kafka, a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data.

When you use Apache Kafka, you capture real-time data from sources such as IoT devices, database change events, and website clickstreams, and deliver it to destinations such as databases and persistent storage.

Kafka Connect is an open-source component of Apache Kafka that provides a framework for connecting with external systems such as databases, key-value stores, search indexes, and file systems. However, manually running Kafka Connect clusters requires you to plan and provision the required infrastructure, deal with cluster operations, and scale it in response to load changes.

Today, we’re announcing a new capability that makes it easier to manage Kafka Connect clusters. MSK Connect allows you to configure and deploy a connector using Kafka Connect with a just few clicks. MSK Connect provisions the required resources and sets up the cluster. It continuously monitors the health and delivery state of connectors, patches and manages the underlying hardware, and auto-scales connectors to match changes in throughput. As a result, you can focus your resources on building applications rather than managing infrastructure.

MSK Connect is fully compatible with Kafka Connect, which means you can migrate your existing connectors without code changes. You don’t need an MSK cluster to use MSK Connect. It supports Amazon MSK, Apache Kafka, and Apache Kafka compatible clusters as sources and sinks. These clusters can be self-managed or managed by AWS partners and 3rd parties as long as MSK Connect can privately connect to the clusters.

Using MSK Connect with Amazon Aurora and Debezium
To test MSK Connect, I want to use it to stream data change events from one of my databases. To do so, I use Debezium, an open-source distributed platform for change data capture built on top of Apache Kafka.

I use a MySQL-compatible Amazon Aurora database as the source and the Debezium MySQL connector with the setup described in this architectural diagram:

To use my Aurora database with Debezium, I need to turn on binary logging in the DB cluster parameter group. I follow the steps in the How do I turn on binary logging for my Amazon Aurora MySQL cluster article.

Next, I have to create a custom plugin for MSK Connect. A custom plugin is a set of JAR files that contain the implementation of one or more connectors, transforms, or converters. Amazon MSK will install the plugin on the workers of the connect cluster where the connector is running.

From the Debezium website, I download the MySQL connector plugin for the latest stable release. Because MSK Connect accepts custom plugins in ZIP or JAR format, I convert the downloaded archive to ZIP format and keep the JARs files in the main directory:

$ tar xzf debezium-connector-mysql-1.6.1.Final-plugin.tar.gz
$ cd debezium-connector-mysql
$ zip -9 ../debezium-connector-mysql-1.6.1.zip *
$ cd ..

Then, I use the AWS Command Line Interface (CLI) to upload the custom plugin to an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS Region I am using for MSK Connect:

$ aws s3 cp debezium-connector-mysql-1.6.1.zip s3://my-bucket/path/

On the Amazon MSK console there is a new MSK Connect section. I look at the connectors and choose Create connector. Then, I create a custom plugin and browse my S3 buckets to select the custom plugin ZIP file I uploaded before.

I enter a name and a description for the plugin and then choose Next.

Now that the configuration of the custom plugin is complete, I start the creation of the connector. I enter a name and a description for the connector.

I have the option to use a self-managed Apache Kafka cluster or one that is managed by MSK. I select one of my MSK cluster that is configured to use IAM authentication. The MSK cluster I select is in the same virtual private cloud (VPC) as my Aurora database. To connect, the MSK cluster and Aurora database use the default security group for the VPC. For simplicity, I use a cluster configuration with auto.create.topics.enable set to true.

In Connector configuration, I use the following settings:

connector.class=io.debezium.connector.mysql.MySqlConnector
tasks.max=1
database.hostname=<aurora-database-writer-instance-endpoint>
database.port=3306
database.user=my-database-user
database.password=my-secret-password
database.server.id=123456
database.server.name=ecommerce-server
database.include.list=ecommerce
database.history.kafka.topic=dbhistory.ecommerce
database.history.kafka.bootstrap.servers=<bootstrap servers>
database.history.consumer.security.protocol=SASL_SSL
database.history.consumer.sasl.mechanism=AWS_MSK_IAM
database.history.consumer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
database.history.consumer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
database.history.producer.security.protocol=SASL_SSL
database.history.producer.sasl.mechanism=AWS_MSK_IAM
database.history.producer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
database.history.producer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
include.schema.changes=true

Some of these settings are generic and should be specified for any connector. For example:

connector.class is the Java class of the connector.
tasks.max is the maximum number of tasks that should be created for this connector.

Other settings are specific to the Debezium MySQL connector:

The database.hostname contains the writer instance endpoint of my Aurora database.
The database.server.name is a logical name of the database server. It is used for the names of the Kafka topics created by Debezium.
The database.include.list contains the list of databases hosted by the specified server.
The database.history.kafka.topic is a Kafka topic used internally by Debezium to track database schema changes.
The database.history.kafka.bootstrap.servers contains the bootstrap servers of the MSK cluster.
The final eight lines (database.history.consumer.* and database.history.producer.*) enable IAM authentication to access the database history topic.

In Connector capacity, I can choose between autoscaled or provisioned capacity. For this setup, I choose Autoscaled and leave all other settings at their defaults.

With autoscaled capacity, I can configure these parameters:

MSK Connect Unit (MCU) count per worker – Each MCU provides 1 vCPU of compute and 4 GB of memory.
The minimum and maximum number of workers.
Autoscaling utilization thresholds – The upper and lower target utilization thresholds on MCU consumption in percentage to trigger auto scaling.

There is a summary of the minimum and maximum MCUs, memory, and network bandwidth for the connector.

For Worker configuration, you can use the default one provided by Amazon MSK or provide your own configuration. In my setup, I use the default one.

In Access permissions, I create a IAM role. In the trusted entities, I add kafkaconnect.amazonaws.com to allow MSK Connect to assume the role.

The role is used by MSK Connect to interact with the MSK cluster and other AWS services. For my setup, I add:

Permissions to write logs to a Amazon CloudWatch log group I created earlier.
Permissions to authenticate to my MSK cluster through IAM.

The Debezium connector needs access to the cluster configuration to find the replication factor to use to create the history topic. For this reason, I add to the permissions policy the kafka-cluster:DescribeClusterDynamicConfiguration action (equivalent Apache Kafka’s DESCRIBE_CONFIGS cluster ACL).

Depending on your configuration, you might need to add more permissions to the role (for example, in case the connector needs access to other AWS resources such as an S3 bucket). If that is the case, you should add permissions before creating the connector.

In Security, the settings for authentication and encryption in transit are taken from the MSK cluster.

In Logs, I choose to deliver logs to CloudWatch Logs to have more information on the execution of the connector. By using CloudWatch Logs, I can easily manage retention and interactively search and analyze my log data with CloudWatch Logs Insights. I enter the log group ARN (it’s the same log group I used before in the IAM role) and then choose Next.

I review the settings and then choose Create connector. After a few minutes, the connector is running.

Testing MSK Connect with Amazon Aurora and Debezium
Now let’s test the architecture I just set up. I start an Amazon Elastic Compute Cloud (Amazon EC2) instance to update the database and start a couple of Kafka consumers to see Debezium in action. To be able to connect to both the MSK cluster and the Aurora database, I use the same VPC and assign the default security group. I also add another security group that gives me SSH access to the instance.

I download a binary distribution of Apache Kafka and extract the archive in the home directory:

$ tar xvf kafka_2.13-2.7.1.tgz

To use IAM to authenticate with the MSK cluster, I follow the instructions in the Amazon MSK Developer Guide to configure clients for IAM access control. I download the latest stable release of the Amazon MSK Library for IAM:

$ wget https://github.com/aws/aws-msk-iam-auth/releases/download/1.1.0/aws-msk-iam-auth-1.1.0-all.jar

In the ~/kafka_2.13-2.7.1/config/ directory I create a client-config.properties file to configure a Kafka client to use IAM authentication:

# Sets up TLS for encryption and SASL for authN.
security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

I add a few lines to my Bash profile to:

Add Kafka binaries to the PATH.
Add the MSK Library for IAM to the CLASSPATH.
Create the BOOTSTRAP_SERVERS environment variable to store the bootstrap servers of my MSK cluster.

$ cat >> ~./bash_profile
export PATH=~/kafka_2.13-2.7.1/bin:$PATH
export CLASSPATH=/home/ec2-user/aws-msk-iam-auth-1.1.0-all.jar
export BOOTSTRAP_SERVERS=<bootstrap servers>

Then, I open three terminal connections to the instance.

In the first terminal connection, I start a Kafka consumer for a topic with the same name as the database server (ecommerce-server). This topic is used by Debezium to stream schema changes (for example, when a new table is created).

$ cd ~/kafka_2.13-2.7.1/
$ kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVERS \
                            --consumer.config config/client-config.properties \
                            --topic ecommerce-server --from-beginning

In the second terminal connection, I start another Kafka consumer for a topic with a name built by concatenating the database server (ecommerce-server), the database (ecommerce), and the table (orders). This topic is used by Debezium to stream data changes for the table (for example, when a new record is inserted).

$ cd ~/kafka_2.13-2.7.1/
$ kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVERS \
                            --consumer.config config/client-config.properties \
                            --topic ecommerce-server.ecommerce.orders --from-beginning

In the third terminal connection, I install a MySQL client using the MariaDB package and connect to the Aurora database:

$ sudo yum install mariadb
$ mysql -h <aurora-database-writer-instance-endpoint> -u <database-user> -p

From this connection, I create the ecommerce database and a table for my orders:

CREATE DATABASE ecommerce;

USE ecommerce

CREATE TABLE orders (
       order_id VARCHAR(255),
       customer_id VARCHAR(255),
       item_description VARCHAR(255),
       price DECIMAL(6,2),
       order_date DATETIME DEFAULT CURRENT_TIMESTAMP
);

These database changes are captured by the Debezium connector managed by MSK Connect and are streamed to the MSK cluster. In the first terminal, consuming the topic with schema changes, I see the information on the creation of database and table:

Struct{source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202831473,db=ecommerce,server_id=1980402433,file=mysql-bin-changelog.000003,pos=9828,row=0},databaseName=ecommerce,ddl=CREATE DATABASE ecommerce,tableChanges=[]}
Struct{source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202878811,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10002,row=0},databaseName=ecommerce,ddl=CREATE TABLE orders ( order_id VARCHAR(255), customer_id VARCHAR(255), item_description VARCHAR(255), price DECIMAL(6,2), order_date DATETIME DEFAULT CURRENT_TIMESTAMP ),tableChanges=[Struct{type=CREATE,id="ecommerce"."orders",table=Struct{defaultCharsetName=latin1,primaryKeyColumnNames=[],columns=[Struct{name=order_id,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=1,optional=true,autoIncremented=false,generated=false}, Struct{name=customer_id,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=2,optional=true,autoIncremented=false,generated=false}, Struct{name=item_description,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=3,optional=true,autoIncremented=false,generated=false}, Struct{name=price,jdbcType=3,typeName=DECIMAL,typeExpression=DECIMAL,length=6,scale=2,position=4,optional=true,autoIncremented=false,generated=false}, Struct{name=order_date,jdbcType=93,typeName=DATETIME,typeExpression=DATETIME,position=5,optional=true,autoIncremented=false,generated=false}]}}]}

Then, I go back to the database connection in the third terminal to insert a few records in the orders table:

INSERT INTO orders VALUES ("123456", "123", "A super noisy mechanical keyboard", "50.00", "2021-08-16 10:11:12");
INSERT INTO orders VALUES ("123457", "123", "An extremely wide monitor", "500.00", "2021-08-16 11:12:13");
INSERT INTO orders VALUES ("123458", "123", "A too sensible microphone", "150.00", "2021-08-16 12:13:14");

In the second terminal, I see the information on the records inserted into the orders table:

Struct{after=Struct{order_id=123456,customer_id=123,item_description=A super noisy mechanical keyboard,price=50.00,order_date=1629108672000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10464,row=0},op=c,ts_ms=1629202993614}
Struct{after=Struct{order_id=123457,customer_id=123,item_description=An extremely wide monitor,price=500.00,order_date=1629112333000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10793,row=0},op=c,ts_ms=1629202993621}
Struct{after=Struct{order_id=123458,customer_id=123,item_description=A too sensible microphone,price=150.00,order_date=1629115994000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=11114,row=0},op=c,ts_ms=1629202993630}

My change data capture architecture is up and running and the connector is fully managed by MSK Connect.

Availability and Pricing
MSK Connect is available in the following AWS Regions: Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), EU (Stockholm), South America (Sao Paulo), US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon). For more information, see the AWS Regional Services List.

With MSK Connect you pay for what you use. The resources used by your connectors can be scaled automatically based on your workload. For more information, see the Amazon MSK pricing page.

Simplify the management of your Apache Kafka connectors today with MSK Connect.

— Danilo

Power your Kafka Streams application with Amazon MSK and AWS Fargate

2021-08-10 Karen Grygoryan

Post Syndicated from Karen Grygoryan original https://aws.amazon.com/blogs/big-data/power-your-kafka-streams-application-with-amazon-msk-and-aws-fargate/

Today, companies of all sizes across all verticals design and build event-driven architectures centered around real-time streaming and stream processing. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming and event data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can continue to use native Apache Kafka APIs to build event-driven architectures, stream changes to and from databases, and power machine learning and analytics applications.

You can apply streaming in a wide range of industries and organizations, such as to capture and analyze data from IoT devices, track and monitor vehicles or shipments, monitor patients in medical facilities, or monitor financial transactions.

In this post, we walk through how to build a real-time stream processing application using Amazon MSK, AWS Fargate, and the Apache Kafka Streams API. The Kafka Streams API is a client library that simplifies development of stream applications. Behind the scenes, Kafka Streams library is really an abstraction over the standard Kafka Producer and Kafka Consumer API. When you build applications with the Kafka Streams library, your data streams are automatically made fault tolerant, and are transparently and elastically distributed over the instances of the applications. Kafka Streams applications are supported by Amazon MSK. Fargate is a serverless compute engine for containers that works with AWS container orchestration services like Amazon Elastic Container Service (Amazon ECS), which allows you to easily run, scale, and secure containerized applications.

We have chosen to run our Kafka Streams application on Fargate, because Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design. Fargate allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity. You only pay for the resources required to run your containers, so there is no over-provisioning and paying for additional servers. Fargate runs each task or pod in its own kernel providing the tasks and pods their own isolated compute environment. This enables your application to have workload isolation and improved security by design.

Architecture overview

Our streaming application architecture consists of a stream producer, which connects to the Twitter Stream API, reads tweets, and publishes them to Amazon MSK. A Kafka Streams processor consumes these messages, performs window aggregation, pushes to topic result, and prints out to logs. Both apps are hosted on Fargate.

The stream producer application connects to the Twitter API (a stream of sample tweets), reads the stream of tweets, extracts only hashtags, and publishes them to the MSK topic. The following is a code snippet from the application:

   var configs = new AppConfig();
    var kafkaService = new KafkaService(configs.kafkaProducer());
    var twitterService = new TwitterService(kafkaService, configs.httpClient());
    if (null != BEARER_TOKEN) {
      twitterService.connectStream(BEARER_TOKEN);
    } else {
      LOG.error(
          "There was a problem getting you bearer token. Please make sure you set the BEARER_TOKEN environment variable");
    }

The MSK cluster is spread across three Availability Zones, with one broker per Availability Zone. We use the AWS-recommended (as of this writing) version of Apache Kafka 2.6.1. Apache Kafka topics have a replication factor and partitions of three, to take advantage of parallelism and resiliency.

The logic of our consumer streaming app is as follows; it counts the number of Twitter hashtags, with a minimum length of 1, that have been mentioned more than four times in a 20-second window:

private static final TimeWindows WINDOW_20_SEC = of(ofSeconds(20)).grace(ofMillis(0));
private static final int MIN_MENTIONED_IN_WINDOW = 4;
private static final int MIN_CHAR_LENGTH = 1;
…
var tweetStream =
    paragraphStream
        .filter(
            (k, v) -> v.length() > MIN_CHAR_LENGTH) // filter hashtags with length less 1 char
        .mapValues((ValueMapper<String, String>) String::toLowerCase) // lowercase hashtags
        .mapValues(String::trim) // remove leading and trailing spaces
        .selectKey((k, v) -> v) // select hashtag as a key
        .groupByKey()
        .windowedBy(WINDOW_20_SEC) // apply 20 seconds window aggregation
        .count(with(String(), Long())) // count hashtags, materialized in state store as String & Long
        .suppress(untilWindowCloses(unbounded())) // suppression will emit only the "final results", buffer unconstrained by size(not recommended for prod)
        .toStream()
        .map((k, v) -> new KeyValue<>(k.key(), v))
        .filter(
            (k, v) -> v > MIN_MENTIONED_IN_WINDOW); // filter hashtags mentioned less than 4 times

Prerequisites

Make sure to complete the following steps as prerequisites:

Create an AWS account. For this post, you configure the required AWS resources in the us-east-1 or us-west-2 Region. If you haven’t signed up, complete the following tasks:
1. Create an account. For instructions, see Sign Up for AWS.
2. Create an AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
Have a Bearer Token associated with your Twitter app. To create a developer account, see Get started with the Twitter developer platform.
Install Docker on your local machine.

Solution overview

To implement this solution, we complete the following steps:

Set up an MSK cluster and Amazon Elastic Container Registry (Amazon ECR).
Build and upload application JAR files to Amazon ECR.
Create an ECS cluster with a Fargate task and service definitions.
Run our streaming application.

Set up an MSK cluster and Amazon ECR

Use the provided AWS CloudFormation template to create the VPC (with other required network components), security groups, MSK cluster with required Kafka topics (twitter_input and twitter_output), and two Amazon ECR repositories, one per each application.

Build and upload application JAR files to Amazon ECR

To build and upload the JAR files to Amazon ECR, complete the following steps:

Download the application code from the GitHub repo.
Build the applications by running the following command in the root of the project:

./gradlew clean build

Create your Docker images (kafka-streams-msk and twitter-stream-producer):

docker-compose build

Retrieve an authentication token and authenticate your Docker client to your registry. Use the following AWS Command Line Interface (AWS CLI) code:

aws ecr get-login-password --region <<region>> | docker login --username AWS --password-stdin <<account_id>>.dkr.ecr.<<region>>.amazonaws.com

Tag and push your images to the Amazon ECR repository:

docker tag kafka-streams-msk:latest  <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/kafka-streams-msk:latest 
docker tag twitter-stream-producer:latest  <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/twitter-stream-producer:latest

Run the following command to push images to your Amazon ECR repositories:

docker push <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/kafka-streams-msk:latest 
docker push <<account_id>>.dkr.ecr.<<region>>.amazonaws.com/twitter-stream-producer:latest

Now you should see images in your Amazon ECR repository (see the following screenshot).

Create an ECS cluster with a Fargate task and service definitions

Use the provided CloudFormation template to create your ECS cluster, Fargate task, and service definitions. Make sure to have Twitter API Bearer Token ready.

Run the streaming application

When the CloudFormation stack is complete, it automatically deploys your applications. After approximately 10 minutes, all your apps should be up and running, aggregating, and producing results. You can see the result in Amazon CloudWatch logs or by navigating to the Logs tab of the Fargate task.

Improvements, considerations, and best practices

Consider the following when implementing this solution:

Fargate enables you to run and maintain a specified number of instances of a task definition simultaneously in a cluster. If any of your tasks should fail or stop for any reason, the Fargate scheduler launches another instance of your task definition to replace it in order to maintain the desired number of tasks in the service. Fargate is not recommended for workloads requiring privileged Docker permissions or workloads requiring more than 4v CPU or 30 Gb of memory (consider whether you can break up your workload into more, smaller containers that each use fewer resources).
Kafka Streams resiliency and availability is provided by state stores. These state stores can either be an in-memory hash map (as used in this post), or another convenient data structure (for example, a RocksDB database that is production recommended). The Kafka Streams application may embed more than one local state store that can be accessed via APIs to store and query data required for processing. In addition, Kafka Streams makes sure that the local state stores are robust to failures. For each state store, it maintains a replicated changelog Kafka topic in which it tracks any state updates. If your app restarts after a crash, it replays the changelog Kafka topic and recreates an in-memory state store.
The AWS Glue Schema Registry is out of scope for this post, but should be considered in order to centrally discover, validate, and control the evolution of streaming data using registered Apache Avro schemas. Some of the benefits that come with it are data policy enforcement, data discovery, controlled schema evolution, and fault-tolerant streaming (data) pipelines.
To improve availability, enable three (the maximum as of this writing) Availability Zone replications within a Region. Amazon MSK continuously monitors cluster health, and if a component fails, Amazon MSK automatically replaces it.
When you enable three Availability Zones your MSK cluster, you not only improve availability, but also improve cluster performance. You spread the load between a larger number of brokers, and can add more partitions per topic.
We highly encourage you to enable encryption at rest, TLS encryption in transit (client-to-broker, broker-to-broker), TLS based certificate authentication, and SASL/SCRAM authentication, which can be secured by AWS Secrets Manager.

Clean up

To clean up your resources, delete the CloudFormation stacks that you launched as part of this post. You can delete these resources via the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI).

Conclusion

In this post, we demonstrated how to build a scalable and resilient real-time stream processing application. We build the solution using the Kafka Streams API, Amazon MSK, and Fargate. We also discussed improvements, considerations, and best practices. You can use this architecture as a reference in your migrations or new workloads. Try it out and share your experience in the comments!

About the Author

Karen Grygoryan, Data Architect, AWS ProServe

Secure connectivity patterns to access Amazon MSK across AWS Regions

2021-08-03 Sam Mokhtari

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/big-data/secure-connectivity-patterns-to-access-amazon-msk-across-aws-regions/

AWS customers often segment their workloads across accounts and Amazon Virtual Private Cloud (Amazon VPC) to streamline access management while being able to expand their footprint. As a result, in some scenarios you, as an AWS customer, need to make an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster accessible to Apache Kafka clients not only in the same Amazon VPC as the cluster but also in a remote Amazon VPC. A guest post by Goldman Sachs presented cross-account connectivity patterns to an MSK cluster using AWS PrivateLink. Inspired by the work of Goldman Sachs, this post demonstrates additional connectivity patterns that can support both cross-account and cross-Region connectivity to an MSK cluster. We also developed sample code that supports the automation of the creation of resources for the connectivity pattern based on AWS PrivateLink.

Overview

Amazon MSK makes it easy to run Apache Kafka clusters on AWS. It’s a fully managed streaming service that automatically configures, and maintains Apache Kafka clusters and Apache Zookeeper nodes for you. Amazon MSK lets you focus on building your streaming solutions and supports familiar Apache Kafka ecosystem tools (such as MirrorMaker, Kafka Connect, and Kafka streams) and helps avoid the challenges of managing the Apache Kafka infrastructure and operations.

If you have workloads segmented across several VPCs and AWS accounts, there may be scenarios in which you need to make Amazon MSK cluster accessible to Apache Kafka clients across VPCs. To provide secure connection between resources across multiple VPCs, AWS provides several networking constructs. Let’s get familiar with these before discussing the different connectivity patterns:

Amazon VPC peering is the simplest networking construct that enables bidirectional connectivity between two VPCs. You can use this connection type to enable between VPCs across accounts and AWS Regions to communicate with each other using private IP addresses.
AWS Transit Gateway provides a highly available and scalable design for connecting VPCs. Unlike VPC peering that can go cross-Region, AWS Transit Gateway is a regional service, but you can use inter-Region peering between transit gateways to route traffic across Regions.

AWS PrivateLink is an AWS networking service that provides private access to a specific service instead of all resources within a VPC and without traversing the public internet. You can use this service to expose your own application in a VPC to other users or applications in another VPC via an AWS PrivateLink-powered service (referred to as an endpoint service). Other AWS principals can then create a connection from their VPC to your endpoint service using an interface VPC endpoint.

Amazon MSK networking

When you create an MSK cluster, either via the AWS Management Console or AWS Command Line Interface (AWS CLI), it’s deployed into a managed VPC with brokers in private subnets (one per Availability Zone) as shown in the following diagram. Amazon MSK also creates the Apache ZooKeeper nodes in the same private subnets.

The brokers in the cluster are made accessible to clients in the customer VPC through elastic network interfaces (ENIs) that appear in the customer account. The security groups on the ENIs dictate the source and type of ingress and egress traffic allowed on the brokers.

IP addresses from the customer VPC are attached to the ENIs, and all network traffic stays within the AWS network and is not accessible to the internet.

Connections between clients and an MSK cluster are always private.

This blog demonstrates four connectivity patterns to securely access an MSK cluster from a remote VPC. The following table lists these patterns and their key characteristics. Each pattern aligns with the networking constructs discussed earlier.

	VPC Peering	AWS Transit Gateway	AWS PrivateLink with a single NLB	WS PrivateLink multiple NLB
Bandwidth	Limited by instance network performance and flow limits.	Up to 50 Gbps	10 Gbps per AZ	10 Gbps per AZ
Pricing	Data transfer charge (free if data transfer is within AZs)	Data transfer charge + hourly charge per attachment	Data transfer charge + interface endpoint charge + Network load balancer charge	Data transfer charge + interface endpoint charge + Network load balancer charge
Scalability	Recommended for smaller number of VPCs	No limit on number of VPCs	No limit on number of VPCs	No limit on number of VPCs

Let’s explore these connectivity options in more detail.

VPC peering

To access an MSK cluster from a remote VPC, the first option is to create a peering connection between the two VPCs.

Let’s say you use Account A to provision an MSK cluster in us-east-1 Region, as shown in the following diagram. Now, you have an Apache Kafka client in the customer VPC in Account B that needs to access this MSK cluster. To enable this connectivity, you just need to create a peering connection between the VPC in Account A and the VPC in Account B. You should also consider implementing fine-grained network access controls with security groups to make sure that only specific resources are accessible between the peered VPCs.

Because VPC peering works across Regions, you can extend this architecture to provide access to Apache Kafka clients in another Region. As shown in the following diagram, to provide access to Kafka clients in the VPC of Account C, you just need to create another peering connection between the VPC in Account C with the VPC in Account A. The same networking principles apply to make sure only specific resources are reachable. In the following diagram, a solid line indicates a direct connection from the Kafka client to MSK cluster, whereas a dotted line indicates a connection flowing via VPC peering.

VPC peering has the following benefits:*

Simplest connectivity option.
Low latency.
No bandwidth limits (it is just limited by instance network performance and flow limits).
Lower overall cost compared to other VPC-to-VPC connectivity options.

However, it has some drawbacks:

VPC peering doesn’t support transitive peering, which means that only directly peered VPCs can communicate with each other.
You can’t use this connectivity pattern when there are overlapping IPv4 or IPv6 CIDR blocks in the VPCs.
Managing access can become challenging as the number of peered VPCs grows.

You can use VPC peering when the number of VPCs to be peered is less than 10.

AWS Transit Gateway

AWS Transit Gateway can provide scalable connectivity to MSK clusters. The following diagram demonstrates how to use this service to provide connectivity to MSK cluster. Let’s again consider a VPC in Account A running an MSK cluster, and an Apache Kafka client in a remote VPC in Account B is looking to connect to this MSK cluster. You set up AWS Transit Gateway to connect these VPCs and use route tables on the transit gateway to control the routing.

To extend this architecture to support access from a VPC in another Region, you need to use another transit gateway because this service can’t span Regions. In other words, for the Apache Kafka client in Account C in us-west-2 to connect to the MSK cluster, you need to peer another transit gateway in us-west-2 with the transit gateway in us-east-1 and work with the route tables to manage access to the MSK cluster. If you need to connect another account in us-west-2, you don’t need an additional transit gateway. The Apache Kafka clients in the new account (Account D) simply require a connection to the existing transit gateway in us-west-2 and the appropriate route tables.

The hub and spoke model for AWS Transit Gateway simplifies management at scale because VPCs only need to connect to one transit gateway per Region to gain access to the MSK cluster in the attached VPCs. However, this setup has some drawbacks:

Unlike VPC peering in which you only pay for data transfer charges, Transit Gateway has an hourly charge per attachment in addition to the data transfer fee.
This connectivity pattern doesn’t support transitive routing.
Unlike VPC peering, Transit Gateway is an additional hop between VPCs which may cause more latency.
It has higher latency (an additional hop between VPCs) comparing to VPC Peering.
The maximum bandwidth (burst) per Availability Zone per VPC connection is 50 Gbps.

You can use AWS Transit Gateway when you need to provide scalable access to the MSK cluster.

AWS PrivateLink

To provide private, unidirectional access from an Apache Kafka client to an MSK cluster across VPCs, you can use AWS PrivateLink. This also eliminates the need to expose the entire VPC or subnet and prevents issues like having to deal with overlapping CIDR blocks between the VPC that hosts the MSK cluster ENIs and the remote Apache Kafka client VPC.

Let’s do a quick recap of the architecture as explained in blog post How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.

Let’s assume Account A has a VPC with three private subnets and an MSK cluster with three broker nodes in a 3-AZ deployment. You have three ENIs, one for each broker node in each subnet representing the broker nodes, and each ENI gets a private IPv4 address from its subnet’s CIDR block, and an MSK broker DNS endpoint. To expose the MSK cluster in Account A to other accounts via AWS PrivateLink, you have to create a VPC endpoint service in Account A. The VPC endpoint service requires the entity, in this case the MSK cluster, to be fronted by a Network Load Balancer (NLB).

You can choose from two patterns using AWS PrivateLink to provide cross-account access to Amazon MSK: with a single NLB or multiple NLBs.

AWS PrivateLink connectivity pattern with a single NLB

The following diagram illustrates access to an MSK cluster via an AWS PrivateLink connectivity pattern with a single NLB.

In this pattern, you have a single dedicated internal NLB in Account A. The NLB has a separate listener for each MSK broker. Because this pattern has a single NLB endpoint, each of the listeners need to listen on unique port. In the preceding diagram, the ports are depicted as 8443, 8444, and 8445. Correspondingly, for each listener, you have a unique target group, each of which has a single registered target: the IP address of an MSK broker ENI. Because the ports are different from the advertised listeners defined in the MSK cluster for each of the broker nodes, the advertised listeners configuration for each of the broker nodes in the cluster should be updated. Additionally, one target group has all the broker ENI IPs as targets and a corresponding listener (on port 9094), which means a request coming to the NLB on port 9094 can be routed to any of the MSK brokers.

In Account B, you need to create a corresponding VPC endpoint for the VPC endpoint service in Account A. Apache Kafka clients in Account B can connect to the MSK cluster in Account B by directing their requests to the VPC endpoint. For Transport Layer Security (TLS) to work, you also need an Amazon Route 53 private hosted zone with the domain name kafka.<region of the amazon msk cluster>.amazonaws.com, with alias resource record sets for each of the broker endpoints pointing to the VPC endpoint in Account B.

In this pattern, for the Apache Kafka clients local to the VPC with the Amazon MSK broker ENIs in Account A to connect to the MSK cluster, you need to set up a Route 53 private hosted zone, similar to Account B, with alias resource record sets for each of the broker endpoints pointing to the NLB endpoint. This is because the ports in the advertised.listener configuration have been changed for the brokers and the default Amazon MSK broker endpoints won’t work.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection (which can be via VPC peering or AWS Transit Gateway) between the VPC in Account B and the VPC in the remote Region. The same networking principles apply to make sure only specific intended resources are reachable.

AWS PrivateLink connectivity pattern with multiple NLBs

In the second pattern, you don’t share one VPC endpoint service or NLB across multiple MSK brokers. Instead, you have an independent set for each broker. Each NLB has only one listener listening on the same port (9094) for requests to each Amazon MSK broker. Correspondingly, you have a separate VPC endpoint service for each NLB and each broker. Just like in the first pattern, in Account B, you need a Route53 hosted private zone to alias broker DNS endpoints to VPC endpoints—in this case, they’re aliased to their own specific VPC endpoint.

This pattern has the advantage of not having to modify the advertised listeners configuration in the MSK cluster. However, there is an additional cost of deploying more NLBs, one for each broker. Furthermore, in this pattern, Apache Kafka clients that are local to the VPC with the MSK broker ENIs in Account A can connect to the cluster as usual with no additional setup needed. The following diagram illustrates this setup.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection between the VPC in Account B and the VPC in the remote Region.

You can use the sample code provided on GitHub to set up the AWS PrivateLink connectivity pattern with multiple NLBs for an MSK cluster. The intent of the code is to automate the creation of multiple resources instead of wiring it manually.

These patterns have the following benefits:

They are scalable solutions and do not limit the number of consumer VPCs.
AWS PrivateLink allows for VPC CIDR ranges to overlap.
You don’t need path definitions or a route table (access only to the MSK cluster), therefore it’s easier to manage

The drawbacks are as follows:

The VPC endpoint and service must be in the same Region.
The VPC endpoints support IPv4 traffic only.
The endpoints can’t be transferred from one VPC to another.

You can use either connectivity pattern when you need your solution to scale to a large number of Amazon VPCs that can consume each service. You can also use either pattern when the cluster and client VPCs have overlapping IP addresses and when you want to restrict access to only the MSK cluster instead of the VPC itself. The single NLB pattern adds relevant complexity to the architecture because you need to maintain an additional target group and listener that has all brokers registered as well as keep the advertised.listeners property up to date. You can offset that complexity with the multiple NLB pattern but at an additional cost for the increased number of NLBs.

Conclusion

In this post, we explored different secure connectivity patterns to access an MSK cluster from a remote VPC. We also discussed the advantages, challenges, and limitations of each connectivity pattern. You can use this post as guidance to help you identify an appropriate connectivity pattern to address your requirements for accessing an MSK cluster. You can also use a combination of connectivity patterns to address your use case.

References

To read more about the solutions that inspired this post, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink and the webinar Cross-Account Connectivity Options for Amazon MSK.

About the Authors

Dr. Sam Mokhtari is a Senior Solutions Architect in AWS. His main area of depth is data and analytics, and he has published more than 30 influential articles in this field. He is also a respected data and analytics advisor who led several large-scale implementation projects across different industries including energy, health, telecom, and transport.

Pooja Chikkala is a Solutions Architect in AWS. Big data and analytics is her area of interest. She has 13 years of experience leading large-scale engineering projects with expertise in designing and managing both on-premises and cloud-based infrastructures.

Rajeev Chakrabarti is a Principal Developer Advocate with the Amazon MSK team. He has worked for many years in the big data and data streaming space. Before joining the Amazon MSK team, he was a Streaming Specialist SA helping customers build streaming pipelines.

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics, and can be reached at IN.

Field Notes: Deliver Messages Using an IoT Rule Action to Amazon Managed Streaming for Apache Kafka

2021-07-30 Siddhesh Keluskar

Post Syndicated from Siddhesh Keluskar original https://aws.amazon.com/blogs/architecture/field-notes-deliver-messages-using-an-iot-rule-action-to-amazon-managed-streaming-for-apache-kafka/

With IoT devices scaling up rapidly, real-time data integration and data processing has become a major challenge. This is why customers often choose Message Queuing Telemetry Transport (MQTT) for message ingestion, and Apache Kafka to build a real-time streaming data pipeline. AWS IoT Core now supports a new IoT rule action to deliver messages from your devices directly to your Amazon MSK or self-managed Apache Kafka clusters for data analysis and visualization, without you having to write a single line of code.

In this post, you learn how to set up a real-time streaming data pipeline for IoT data using AWS IoT Core rule and Amazon Managed Streaming for Apache Kafka. The audience for this post is architects and developers creating solutions to ingest sensor data, and high-volume high-frequency streaming data, and process it using a Kafka cluster. Also, this blog describes the SASL_SSL (using user name and password) method to access your Kafka cluster.

Overview of solution

Figure 1 represents an IoT data ingestion pipeline where multiple IoT devices connect to AWS IoT Core. These devices can send messages to AWS IoT Core over MQTT or HTTPS protocol. AWS IoT Core rule for Kafka is configured to intercept messages from the desired topic and route them to the Apache Kafka cluster. These messages can then be received by multiple consumers connected to the Kafka cluster. In this post, we will use AWS Python SDK to represent IoT devices and publish messages.

Figure 1 – Architecture representing an IoT ingestion pipeline

Prerequisites

Familiarity with MQTT protocol and Kafka command line interface (CLI)
AWS account with console access

Walkthrough

I will show you how to stream AWS IoT data on an Amazon MSK cluster using AWS IoT Core rules and SASL_SSL SCRAM-SHA-512 mechanism of authentication. Following are the steps for this walkthrough:

Create an Apache Kafka cluster using Amazon MSK.
Configure an Apache Kafka cluster for SASL_SSL authentication.
Set up a Kafka producer and consumer on AWS Cloud9 to test the setup.
Configure an IoT Rule action to send a message to Kafka.

1. Create an Apache Kafka cluster using Amazon MSK

The first step is to create an Apache Kafka cluster. Open the service page for Amazon MSK by signing in to your AWS account.
Choose Create Cluster, and select Custom Create. AWS IoT Core supports SSL and SASL_SSL based authentication for Amazon MSK. We are using custom settings to configure these authentication methods.

Figure 2 – Screenshot showing how to create an MSK cluster.

Assign a cluster name, and select Apache Kafka (version of your choice), for this walkthrough, we are using 2.6.1.
Keep the configuration as Amazon MSK default configuration. Choose your Networking components: VPC, number of Availability Zones (a minimum of two is required for high availability), and subnets.
Choose SASL/SCRAM authentication (default selection is None).

Use the encryption settings as shown in the following screenshot:

Figure 3 – Screenshot showing Encryption Settings

Keep the monitoring settings as Basic Monitoring, and Choose Create Cluster.
It takes approximately 15–20 minutes for the cluster to be created.

2. Configure an Apache Kafka cluster for SASL_SSL authentication

When the Apache Kafka cluster is available, we must then configure authentication for producers and consumers.
Open AWS Secrets Manager, choose Store a new secret, and then choose Other type of secrets.
Enter user name and password as two keys, and assign the user name and password values of your choice.

Figure 5 - Screenshot showing how to store a new secret

Figure 4 – Screenshot showing how to store a new secret

Next, select Add new key link.
Note: Do not select DefaultEncryptionKey! A secret created with the default key cannot be used with an Amazon MSK cluster. Only a Customer managed key can be used as an encryption key for an Amazon MSK–compatible secret.
To add a new key, select Create key, select Symmetric key, and choose Next.
Type an Alias, and choose Next.
Select appropriate users as Key administrators, and choose Next.
Review the configuration, and select Finish.

Figure 6 - Select the newly-created Customer Managed Key as the encryption key

Figure 5 – Select the newly-created Customer Managed Key as the encryption key

Figure 7 - Specify the key value pais to be stored in this secret

Figure 6 – Specify the key value pair to be stored in this secret

Select the newly-created Customer Managed Key as the encryption key, and choose Next.
Provide a Secret name (Secret name must start with AmazonMSK_ for Amazon MSK cluster to recognize it), for example, AmazonMSK_SECRET_NAME.
Choose Next twice, and then choose Store.

Select the newly-created Customer Managed Key as the encryption key, and choose Next. Provide a Secret name (Secret name must start with AmazonMSK_ for Amazon MSK cluster to recognize it) (for example, AmazonMSK_SECRET_NAME). Choose Next twice, and then choose Store.

Figure 7 – Storing a new secret

Open the Amazon MSK service page, and select your Amazon MSK cluster. Choose Associate Secrets, and then select Choose secrets (this will only be available after the cluster is created and in Active Status).
Choose the secret we created in the previous step, and choose Associate secrets. Only the secret name starting with AmazonMSK_ will be visible.

3. Set up Kafka producer and consumer on AWS Cloud9 to test the setup

To test if the cluster and authentication is correctly setup, we use Kafka SDK on AWS Cloud9 IDE.
Choose Create environment, and follow the console to create a new AWS Cloud9 environment. You can use an existing AWS Cloud9 environment, in addition to an environment with Kafka consumer and producer already configured.
This blog requires Java 8 or earlier.
Verify your version of Java with the command: java -version. Next, add your AWS Cloud9 instance Security Group to inbound rules of your Kafka cluster.
Open the Amazon MSK page and select your cluster, then choose Security groups applied.

Figure 9 - Selecting Security Groups Applied

Figure 8 – Selecting Security Groups Applied

Next, choose Inbound rules, and then choose Edit inbound rules.
Choose Add rule, and add Custom TCP ports 2181 and 9096 with Security Group of your AWS Cloud9 instance.

Figure 10 - Screenshot showing rules applied

Figure 9 – Screenshot showing rules applied

The Security Group for your AWS Cloud9 can be found in the Environment details section of your AWS Cloud9 instance.

Figure 11 - Screenshot showing Edit Inbound Rules added

Figure 10 – Screenshot showing Edit Inbound Rules added

Use Port range values as per the client information section of your Bootstrap server and Zookeeper connection.

Figure 12 - Screesnhot showing where to access 'View client information'

Figure 11 – Screenshot showing where to access ‘View client information’

Figure 13 - Screesnhot showing client integration information

Figure 12 – Screenshot showing client integration information

Invoke the following commands on AWS Cloud9 console to download and extract Kafka CLI tools:

wget https://archive.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz
tar -xzf kafka_2.12-2.2.1.tgz
cd kafka_2.12-2.2.1/
mkdir client && cd client

Next, create a file users_jass.conf, and add the user name and password that you added in Secrets Manage:

sudo nano users_jaas.conf

Paste the following configuration and save. Verify the user name and passwords are the same as saved in Secrets Manager.

KafkaClient {
   org.apache.kafka.common.security.scram.ScramLoginModule required
   username="hello"
   password="world";
};

Invoke the following commands:

export KAFKA_OPTS=-Djava.security.auth.login.config=$PWD/users_jaas.conf

Create a new file with name client_sasl.properties.

sudo nano client_sasl.properties

Copy the following content to file:

security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
ssl.truststore.location=<path-to-keystore-file>/kafka.client.truststore.jks

<path-to-keystore-file> can be retrieved by running following command:

cd ~/environment/kafka_2.12-2.2.1/client
echo $PWD

Next, copy the cacerts file from your Java lib folder to client folder. The path of Java lib folder might be different based on your version of Java.

cd ~/environment/kafka_2.12-2.2.1/client
cp /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib/security/cacerts kafka.client.truststore.jks

Figure 14 - Screenshot showing client integration information

Figure 13 – Screenshot showing client integration information

Save the previous endpoints as BOOTSTRAP_SERVER and ZOOKEEPER_STRING.

export BOOTSTRAP_SERVER=b-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096,b-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096
export ZOOKEEPER_STRING=z-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181,z-3.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181,z-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:2181

Save the Topic name in an environment variable.

TOPIC="AWSKafkaTutorialTopic"

Next, create a new Topic using the Zookeeper String.

cd ~/environment/kafka_2.12-2.2.1
bin/kafka-topics.sh --create --zookeeper $ZOOKEEPER_STRING --replication-factor 2 --partitions 1 --topic $TOPIC

Confirm that you receive the message: Created topic AWSKafkaTutorialTopic.
Start Kafka producer by running this command in your Kafka folder:

cd ~/environment/kafka_2.12-2.2.1

bin/kafka-console-producer.sh --broker-list $BOOTSTRAP_SERVER --topic $TOPIC --producer.config client/client_sasl.properties

Next, open a new Terminal by pressing the + button, and initiate the following commands to configure the environment variables:

export BOOTSTRAP_SERVER=b-2.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096,b-1.iot-demo-cluster.slu5to.c13.kafka.us-east-1.amazonaws.com:9096
TOPIC="AWSKafkaTutorialTopic"

cd ~/environment/kafka_2.12-2.2.1/client
export KAFKA_OPTS=-Djava.security.auth.login.config=$PWD/users_jaas.conf

cd ~/environment/kafka_2.12-2.2.1/
bin/kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --topic $TOPIC --from-beginning --consumer.config client/client_sasl.properties --from-beginning

Now that you have a Kafka consumer and producer opened side-by-side, you can type in producer terminal and verify it from the consumer terminal.

Now that you have a Kafka consumer and producer opened side-by-side, you can type in producer terminal and verify it from the consumer terminal.

Figure 14 – Screenshot showing Kafka consumer and producer opened side-by-side

4. Configure an IoT Rule action to send a message to Kafka

Create an AWS Identity and Access Management (IAM) role with SecretsManager permissions to allow IoT rule to access Kafka KeyStore in AWS Secrets Manager.
Sign in to IAM, select Policies from the left-side panel, choose Create policy.
Select Choose a service, and search for AWS KMS.
In Actions, choose All AWS KMS actions. Select All resources in the Resources section, and choose Next.
Name the policy KMSfullAccess, and choose Create policy.
Select Roles from the left-side panel, choose Create Role, then select EC2 from Choose a use case, and choose Next:Permissions.
Assign the policy SecretsManagerReadWrite. Note: if you do not select EC2, SecretsManager Policy will be unavailable.
Search for and select SecretsManagerReadWrite and KMSfullAccess Policy.

Add tags, type Role name as kafkaSASLRole, and choose Create Role.

After the Role is created, search the newly-created Role name to view the Summary of the role.
Choose the Trust relationships tab, and choose Edit trust relationship.

Enter the following trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "iot.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Choose Update Trust Policy.
Next, create a new AWS IoT Core rule by signing in to the AWS IoT Core service. Choose Act from the left side-menu, and select Rules.
Choose Create. Insert details for Name, Description, and Rule query statement, and then choose Add action. The following query is used for this post:
SELECT * from ‘iot/topic’
Select Send a message to an Apache Kafka cluster. Next, choose Configure action.

Figure 15 – Screenshot to create a rule

Create a VPC destination (if you do not already have one).

Figure 16 – How to Create a VPC destination

Create a VPC destination (if you do not already have one).
Select the VPC ID of your Kafka cluster. Select a Security Group with access to Kafka cluster Security Group.
Choose security group settings of the EC2 instance we created, or the security group of Kafka cluster.
Choose Create Role, and then select Create Destination. It takes approximately 5–10 minutes for the Destination to be Enabled. After the status is Enabled, navigate back to the Rule creation page and select the VPC Destination.
Enter AWSKafkaTutorialTopic as Kafka topic (confirm there is no extra space after the topic name, or you will get an error). Do not update Key and Partition boxes.

Figure 17 – Screenshot showing how to enter the AWSKafkaTutorialTopic
Verify the Security Group of your VPC destination is added to the inbound list for your Kafka cluster.

Figure 18 - Showing Inbound list for Kafka Cluster

Figure 18 – Showing Security Group for Kafka Cluster

Figure -Screenshot showing Inbound Inbound rules

Figure 19 -Screenshot showing Inbound Inbound rules

The first two Custom TCP entries are for AWS Cloud9 security group. The last two entries are for VPC endpoint.

Set the Client properties as follows:

Bootstrap.server = The TLS bootstrap string for Kafka cluster

security.protocol = SASL_SSL

ssl.truststore = EMPTY for Amazon MSK, enter SecretBinary template for self-managed Kafka

ss.truststore.password = EMPTY for Amazon MSK, enter truststore password for self-managed Kafka

sasl.mechanism = SCRAM-SHA-512

Replace the secret name with your stored secret name starting with AmazonMSK_, replace the IAM role ARN with your IAM role ARN.
The secret and IAM role are created in previous steps of this post. Enter the following template in the sasl.scram.username field to retrieve username from Secrets Manager.

${get_secret('AmazonMSK_cluster_secret','SecretString','username','arn:aws:iam::318219976534:role/kafkaSASLRole')}

Perform a similar step for sasl.scram.password field:

${get_secret('AmazonMSK_cluster_secret','SecretString','password','arn:aws:iam::318219976534:role/kafkaSASLRole')}

Choose Add action.
Choose Create rule.

Testing the data pipeline

Open MQTT test client from AWS IoT Core page.
Publish the message to the MQTT topic that you configured while creating the rule.
Keep the consumer session active (created in earlier step). You will see data published on the MQTT topic being streamed to Kafka consumer.

Figure 20 – Screenshot showing testing the data pipeline

Common troubleshooting checks

Confirm that your:

AWS Cloud9 Security Group is added to Amazon MSK Security Group Inbound rule
VPC endpoint Security Group is added to Amazon MSK Security Group Inbound rule
Topic is created in the Kafka cluster
IAM role has Secrets Manager and KMS permissions
Environment variables are correctly configured in terminal
Folder paths have been correctly followed

Cleaning up

To avoid incurring future changes, delete the following resources:

Amazon MSK cluster
AWS IoT Core rule
IAM role
Secrets Manager Secret
AWS Cloud9 instance

Conclusion

In this post, I showed you how to configure an IoT Rule Action to deliver messages to Apache Kafka cluster using AWS IoT Core and Amazon MSK. You can now build a real-time streaming data pipeline by securely delivering MQTT messages to a highly-scalable, durable, and reliable system using Apache Kafka.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Amazon MSK Replicator and MirrorMaker2: Choosing the right replication strategy for Apache Kafka disaster recovery and migrations

2021-06-17 Anusha Dharmalingam

Post Syndicated from Anusha Dharmalingam original https://aws.amazon.com/blogs/big-data/amazon-msk-replicator-and-mirrormaker2-choosing-the-right-replication-strategy-for-apache-kafka-disaster-recovery-and-migrations/

Customers need to replicate their Apache Kafka for a variety of reasons, such as compliance requirements, cluster migrations, and disaster recovery (DR) implementations. However, the right replication strategy can vary depending on the application context. In this post, we walk through the different considerations for using Amazon MSK Replicator over Apache Kafka’s MirrorMaker 2, and help you choose the right replication solution for your use case. We also discuss how to make applications using Amazon Managed Streaming for Apache Kafka (Amazon MSK) resilient to disasters using a multi-Region Kafka architecture using MSK Replicator.

Challenges with choosing DR strategies

Customers create business continuity plans and DR strategies to maximize resiliency for their applications, because downtime or data loss can result in losing revenue or halting operations. DR planning helps the business continue running in the event of a disaster impacting a subset of their application architecture. For customers using Kafka as a core streaming and messaging service in their applications, planning for DR for their Kafka infrastructure is an essential part of meeting goals for their application Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Amazon MSK is a fully managed service that makes it straightforward to build and run Kafka to process streaming data. Amazon MSK provides high availability by offering multi-AZ configurations to distribute brokers across multiple Availability Zones within an AWS Region. A single MSK cluster deployment provides message durability through intra-cluster data replication. Data replication with a replication factor of 3 and min-ISR value of 2 along with the producer setting acks=all provides the strongest availability guarantees, because it makes sure other brokers in the cluster acknowledge receiving the data before the leader broker responds to the producer. This design provides robust protection against single broker failure as well as single-AZ failure.

For enhanced resilience within a single Region, Amazon MSK also offers Express brokers, which significantly improve Kafka cluster reliability, throughput, recovery times. Express brokers include pay-as-you-go storage, automatic best-practice reliability configurations, no maintenance windows, and faster broker scaling and recovery times. This architecture reduces recovery time, minimizes the chance of errors with misconfigurations, and increases throughput, making your Kafka clusters more resilient across Availability Zones.

However, if an unlikely issue is impacting your applications or infrastructure across more than one Availability Zone, the architecture outlined in this post can help you prepare, respond, and recover from it.

For companies that can withstand a longer RTO but require a lower RPO on Amazon MSK, backing up data to Amazon Simple Storage Service (Amazon S3) is sufficient as a DR plan. This approach requires you to think through how to handle restarting the application after a DR failover. In this approach, you build a system to recover the data from Amazon S3 to Kafka topics (as described in Back up and restore Kafka topic data using Amazon MSK Connect). Depending on the volume of data being restored, it might take a long time to recover in this scenario. Additionally, you must consider how to handle consumer group offsets, and whether to allow applications to consume from the latest offset in the restored Kafka topics. Due to the high RTO, as well as the complexity and challenges associated with this approach, most streaming use cases rely on the availability of the MSK cluster itself for their business continuity plan. In these cases, setting up MSK clusters in multiple Regions and configuring data replication between clusters provides the required business resilience and continuity.

Choosing the right replication solution: MSK Replicator vs MirrorMaker 2

AWS recommends two primary solutions for cross-Region Kafka replication: MSK Replicator and MirrorMaker 2. Understanding when to use each solution is crucial for designing an effective DR strategy.

MSK Replicator: For most MSK cluster replications in the same account

MSK Replicator is a fully managed, serverless Kafka replication service that makes it straightforward to reliably replicate data across MSK clusters in different Regions or within the same Region. MSK Replicator is the recommended solution for application scenarios replicating data within the same AWS account. MSK Replicator has the following benefits:

Replication between MSK clusters – It supports replicating between MSK clusters in the same AWS account (including active-active or active-passive DR architectures for Amazon MSK)
No infrastructure management – It’s fully serverless with automatic scaling and straightforward setup through the AWS Management Console, AWS Command Line Interface (AWS CLI), or APIs
Built-in monitoring – It’s integrated with Amazon CloudWatch metrics and logging
Built-in high availability – As a managed service, it offers built-in fault tolerance across Availability Zones

MirrorMaker 2: For migrations and complex and hybrid scenarios

MirrorMaker 2 (MM2) remains the preferred solution for specific use cases that require more flexibility or involve non-Amazon MSK environments. MM2 is a utility bundled as part of Kafka that helps replicate data between Kafka clusters using the Kafka Connect framework.

We recommend MirrorMaker 2 for the following use cases:

Cross-account replication – Replicating data between MSK clusters in different AWS accounts
Migrations to Amazon MSK – Migrating from existing Kafka clusters on premises, in other clouds, or on self-managed Amazon Elastic Compute Cloud (Amazon EC2) deployments
Cross-cloud or hybrid cloud scenarios – Replicating between Kafka running on-premises or on different cloud providers and Amazon MSK for disaster recovery or data analytics use cases
Using mTLS or SASL/SCRAM authentication – When you need mutual TLS certificate-based or SASL/SCRAM authentication and can’t enable AWS Identity and Access Management (IAM) authentication in your MSK cluster (for replication from one MSK cluster to another in these scenarios, you can still use MSK Replicator by enabling IAM authentication in addition to existing authentication methods)
Custom replication policies – Advanced topic naming or transformation requirements

In the following sections, we discuss the architecture and deployment approaches for use cases where MSK Replicator and MirrorMaker 2 are the appropriate choices.

MSK Replicator solution overview

The following diagram illustrates the architecture for using MSK Replicator.

We create two MSK clusters – one in the primary Region, the other in the secondary Region as a standby cluster for disaster recovery. We deploy MSK Replicator in the secondary region to replicate topics, ACLs, data, and consumer group offsets from the primary cluster. In this solution, we showcase a single-direction replication for active-passive disaster recovery. This solution can also be extended for active-active disaster recovery scenarios. Our Kafka clients connect to the primary cluster and can be configured to connect to the secondary cluster in the event of a disaster recovery failover.

For details on implementation steps, refer to Introducing Amazon MSK Replicator – Fully Managed Replication across MSK Clusters in Same or Different AWS Regions. For details on disaster recovery scenarios, refer to Use replication to increase the resiliency of a Kafka streaming application across Regions. These resources provide the following benefits:

Full deployment steps – Step by step deployment process for MSK Replicator between regions
Comprehensive examples – Multiple deployment scenarios and configurations
Failover process – Key steps in executing a disaster recovery failover when using MSK Replicator

MirrorMaker2 solution overview

The following diagram illustrates the architecture for using MirrorMaker 2.

We create an MSK cluster in the primary Region, with the existing Kafka cluster on premises. This Kafka cluster is analogous to Kafka clusters running in other clouds, or in self-managed Kafka clusters on Amazon EC2. In this solution, we showcase a single-direction replication for cluster migration scenarios. Our Kafka clients interact with the on-premises Kafka cluster and can be migrated to run on AWS to interact with the MSK cluster.

Rather than manually configuring each component, we recommend using the automated deployment resources available in the following GitHub repository. For a step-by-step walkthrough of deploying MirrorMaker 2 on Amazon ECS with Fargate using auto scaling, refer to Amazon MSK Migration Workshop: Modernizing with Express Brokers. These resources provide the following benefits:

Infrastructure as code – Terraform for MSK clusters and supporting infrastructure
Containerized Kafka Connect – Docker images optimized for AWS
Amazon ECS with AWS Fargate deployment – Scalable, serverless container deployment using Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
Auto scaling configuration – Automatic scaling based on workload demands
Comprehensive examples – Multiple deployment scenarios and configurations
Migration process – Key steps in executing a Kafka migration using MM2

Conclusion

Choosing the right replication solution depends on your specific requirements. We recommend using MSK Replicator when replicating from one MSK cluster to another and you want a fully managed solution for disaster recovery. MirrorMaker 2 is recommended for migrations to Amazon MSK, hybrid environments, or when you need complex custom replication policies.

For MSK Replicator deployments, refer to Introducing Amazon MSK Replicator – Fully Managed Replication across MSK Clusters in Same or Different AWS Regions and Use replication to increase the resiliency of a Kafka streaming application across Regions.

For MirrorMaker 2 deployments, refer to the GitHub repository and Amazon MSK Migration Workshop to implement production-ready solutions with automated deployment, monitoring, and scaling capabilities.

These approaches provide a customizable set of options for data redundancy and business continuity capabilities needed to meet regulatory compliance and disaster recovery requirements, while minimizing operational overhead through automation and best practices.

About the Author

Introducing Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python, or Scala

2021-05-27 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-kinesis-data-analytics-studio-quickly-interact-with-streaming-data-using-sql-python-or-scala/

The best way to get timely insights and react quickly to new information you receive from your business and your applications is to analyze streaming data. This is data that must usually be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and can be used for a variety of analytics including correlations, aggregations, filtering, and sampling.

To make it easier to analyze streaming data, today we are pleased to introduce Amazon Kinesis Data Analytics Studio.

Now, from the Amazon Kinesis console you can select a Kinesis data stream and with a single click start a Kinesis Data Analytics Studio notebook powered by Apache Zeppelin and Apache Flink to interactively analyze data in the stream. Similarly, you can select a cluster in the Amazon Managed Streaming for Apache Kafka console to start a notebook to analyze data in Apache Kafka streams. You can also start a notebook from the Kinesis Data Analytics Studio console and connect to custom sources.

In the notebook, you can interact with streaming data and get results in seconds using SQL queries and Python or Scala programs. When you are satisfied with your results, with a few clicks you can promote your code to a production stream processing application that runs reliably at scale with no additional development effort.

For new projects, we recommend that you use the new Kinesis Data Analytics Studio over Kinesis Data Analytics for SQL Applications. Kinesis Data Analytics Studio combines ease of use with advanced analytical capabilities, which makes it possible to build sophisticated stream processing applications in minutes. Let’s see how that works in practice.

Using Kinesis Data Analytics Studio to Analyze Streaming Data
I want to get a better understanding of the data sent by some sensors to a Kinesis data stream.

To simulate the workload, I use this random_data_generator.py Python script. You don’t need to know Python to use Kinesis Data Analytics Studio. In fact, I am going to use SQL in the following steps. Also, you can avoid any coding and use the Amazon Kinesis Data Generator user interface (UI) to send test data to Kinesis Data Streams or Kinesis Data Firehose. I am using a Python script to have finer control over the data that is being sent.

import datetime
import json
import random
import boto3

STREAM_NAME = "my-input-stream"


def get_random_data():
    current_temperature = round(10 + random.random() * 170, 2)
    if current_temperature > 160:
        status = "ERROR"
    elif current_temperature > 140 or random.randrange(1, 100) > 80:
        status = random.choice(["WARNING","ERROR"])
    else:
        status = "OK"
    return {
        'sensor_id': random.randrange(1, 100),
        'current_temperature': current_temperature,
        'status': status,
        'event_time': datetime.datetime.now().isoformat()
    }


def send_data(stream_name, kinesis_client):
    while True:
        data = get_random_data()
        partition_key = str(data["sensor_id"])
        print(data)
        kinesis_client.put_record(
            StreamName=stream_name,
            Data=json.dumps(data),
            PartitionKey=partition_key)


if __name__ == '__main__':
    kinesis_client = boto3.client('kinesis')
    send_data(STREAM_NAME, kinesis_client)

This script sends random records to my Kinesis data stream using JSON syntax. For example:

{'sensor_id': 77, 'current_temperature': 93.11, 'status': 'OK', 'event_time': '2021-05-19T11:20:00.978328'}
{'sensor_id': 47, 'current_temperature': 168.32, 'status': 'ERROR', 'event_time': '2021-05-19T11:20:01.110236'}
{'sensor_id': 9, 'current_temperature': 140.93, 'status': 'WARNING', 'event_time': '2021-05-19T11:20:01.243881'}
{'sensor_id': 27, 'current_temperature': 130.41, 'status': 'OK', 'event_time': '2021-05-19T11:20:01.371191'}

From the Kinesis console, I select a Kinesis data stream (my-input-stream) and choose Process data in real time from the Process drop-down. In this way, the stream is configured as a source for the notebook.

Then, in the following dialog box, I create an Apache Flink – Studio notebook.

I enter a name (my-notebook) and a description for the notebook. The AWS Identity and Access Management (IAM) permissions to read from the Kinesis data stream I selected earlier (my-input-stream) are automatically attached to the IAM role assumed by the notebook.

I choose Create to open the AWS Glue console and create an empty database. Back in the Kinesis Data Analytics Studio console, I refresh the list and select the new database. It will define the metadata for my sources and destinations. From here, I can also review the default Studio notebook settings. Then, I choose Create Studio notebook.

Now that the notebook has been created, I choose Run.

When the notebook is running, I choose Open in Apache Zeppelin to get access to the notebook and write code in SQL, Python, or Scala to interact with my streaming data and get insights in real time.

In the notebook, I create a new note and call it Sensors. Then, I create a sensor_data table describing the format of the data in the stream:

%flink.ssql

CREATE TABLE sensor_data (
    sensor_id INTEGER,
    current_temperature DOUBLE,
    status VARCHAR(6),
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
)
PARTITIONED BY (sensor_id)
WITH (
    'connector' = 'kinesis',
    'stream' = 'my-input-stream',
    'aws.region' = 'us-east-1',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
)

The first line in the previous command tells to Apache Zeppelin to provide a stream SQL environment (%flink.ssql) for the Apache Flink interpreter. I can also interact with the streaming data using a batch SQL environment (%flink.bsql), or Python (%flink.pyflink) or Scala (%flink) code.

The first part of the CREATE TABLE statement is familiar to anyone who has used SQL with a database. A table is created to store the sensor data in the stream. The WATERMARK option is used to measure progress in the event time, as described in the Event Time and Watermarks section of the Apache Flink documentation.

The second part of the CREATE TABLE statement describes the connector used to receive data in the table (for example, kinesis or kafka), the name of the stream, the AWS Region, the overall data format of the stream (such as json or csv), and the syntax used for timestamps (in this case, ISO 8601). I can also choose the starting position to process the stream, I am using LATEST to read the most recent data first.

When the table is ready, I find it in the AWS Glue Data Catalog database I selected when I created the notebook:

Now I can run SQL queries on the sensor_data table and use sliding or tumbling windows to get a better understanding of what is happening with my sensors.

For an overview of the data in the stream, I start with a simple SELECT to get all the content of the sensor_data table:

%flink.ssql(type=update)

SELECT * FROM sensor_data;

This time the first line of the command has a parameter (type=update) so that the output of the SELECT, which is more than one row, is continuously updated when new data arrives.

On the terminal of my laptop, I start the random_data_generator.py script:

$ python3 random_data_generator.py

At first I see a table that contains the data as it comes. To get a better understanding, I select a bar graph view. Then, I group the results by status to see their average current_temperature, as shown here:

As expected by the way I am generating these results, I have different average temperatures depending on the status (OK, WARNING, or ERROR). The higher the temperature, the greater the probability that something is not working correctly with my sensors.

I can run the aggregated query explicitly using a SQL syntax. This time, I want the result computed on a sliding window of 1 minute with results updated every 10 seconds. To do so, I am using the HOP function in the GROUP BY section of the SELECT statement. To add the time to the output of the select, I use the HOP_ROWTIME function. For more information, see how group window aggregations work in the Apache Flink documentation.

%flink.ssql(type=update)

SELECT sensor_data.status,
       COUNT(*) AS num,
       AVG(sensor_data.current_temperature) AS avg_current_temperature,
       HOP_ROWTIME(event_time, INTERVAL '10' second, INTERVAL '1' minute) as hop_time
  FROM sensor_data
 GROUP BY HOP(event_time, INTERVAL '10' second, INTERVAL '1' minute), sensor_data.status;

This time, I look at the results in table format:

To send the result of the query to a destination stream, I create a table and connect the table to the stream. First, I need to give permissions to the notebook to write into the stream.

In the Kinesis Data Analytics Studio console, I select my-notebook. Then, in the Studio notebooks details section, I choose Edit IAM permissions. Here, I can configure the sources and destinations used by the notebook and the IAM role permissions are updated automatically.

In the Included destinations in IAM policy section, I choose the destination and select my-output-stream. I save changes and wait for the notebook to be updated. I am now ready to use the destination stream.

In the notebook, I create a sensor_state table connected to my-output-stream.

%flink.ssql

CREATE TABLE sensor_state (
    status VARCHAR(6),
    num INTEGER,
    avg_current_temperature DOUBLE,
    hop_time TIMESTAMP(3)
)
WITH (
'connector' = 'kinesis',
'stream' = 'my-output-stream',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601');

I now use this INSERT INTO statement to continuously insert the result of the select into the sensor_state table.

%flink.ssql(type=update)

INSERT INTO sensor_state
SELECT sensor_data.status,
    COUNT(*) AS num,
    AVG(sensor_data.current_temperature) AS avg_current_temperature,
    HOP_ROWTIME(event_time, INTERVAL '10' second, INTERVAL '1' minute) as hop_time
FROM sensor_data
GROUP BY HOP(event_time, INTERVAL '10' second, INTERVAL '1' minute), sensor_data.status;

The data is also sent to the destination Kinesis data stream (my-output-stream) so that it can be used by other applications. For example, the data in the destination stream can be used to update a real-time dashboard, or to monitor the behavior of my sensors after a software update.

I am satisfied with the result. I want to deploy this query and its output as a Kinesis Analytics application. To do so, I need to provide an S3 location to store the application executable.

In the configuration section of the console, I edit the Deploy as application configuration settings. There, I choose a destination bucket in the same region and save changes.

I wait for the notebook to be ready after the update. Then, I create a SensorsApp note in my notebook and copy the statements that I want to execute as part of the application. The tables have already been created, so I just copy the INSERT INTO statement above.

From the menu at the top right of my notebook, I choose Build SensorsApp and export to Amazon S3 and confirm the application name.

When the export is ready, I choose Deploy SensorsApp as Kinesis Analytics application in the same menu. After that, I fine-tune the configuration of the application. I set parallelism to 1 because I have only one shard in my input Kinesis data stream and not a lot of traffic. Then, I run the application, without having to write any code.

From the Kinesis Data Analytics applications console, I choose Open Apache Flink dashboard to get more information about the execution of my application.

Availability and Pricing
You can use Amazon Kinesis Data Analytics Studio today in all AWS Regions where Kinesis Data Analytics is generally available. For more information, see the AWS Regional Services List.

In Kinesis Data Analytics Studio, we run the open-source versions of Apache Zeppelin and Apache Flink, and we contribute changes upstream. For example, we have contributed bug fixes for Apache Zeppelin, and we have contributed to AWS connectors for Apache Flink, such as those for Kinesis Data Streams and Kinesis Data Firehose. Also, we are working with the Apache Flink community to contribute availability improvements, including automatic classification of errors at runtime to understand whether errors are in user code or in application infrastructure.

With Kinesis Data Analytics Studio, you pay based on the average number of Kinesis Processing Units (KPU) per hour, including those used by your running notebooks. One KPU comprises 1 vCPU of compute, 4 GB of memory, and associated networking. You also pay for running application storage and durable application storage. For more information, see the Kinesis Data Analytics pricing page.

Start using Kinesis Data Analytics Studio today to get better insights from your streaming data.

— Danilo

Amazon MSK backup for Archival, Replay, or Analytics

2021-02-19 Rohit Yadav

Post Syndicated from Rohit Yadav original https://aws.amazon.com/blogs/architecture/amazon-msk-backup-for-archival-replay-or-analytics/

Amazon MSK is a fully managed service that helps you build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes. You can also stream changes to and from databases, and power machine learning and analytics applications.

Amazon MSK simplifies the setup, scaling, and management of clusters running Apache Kafka. MSK manages the provisioning, configuration, and maintenance of resources for a highly available Kafka clusters. It is fully compatible with Apache Kafka and supports familiar community-build tools such as MirrorMaker 2.0, Kafka Connect and Kafka streams.

Introduction

In the past few years, the volume of data that companies must ingest has increased significantly. Information comes from various sources, like transactional databases, system logs, SaaS platforms, mobile, and IoT devices. Businesses want to act as soon as the data arrives. This has resulted in increased adoption of scalable real-time streaming solutions. These solutions scale horizontally to provide the needed throughput to process data in real time, with milliseconds of latency. Customers have adopted Amazon MSK as a top choice of streaming platforms. Amazon MSK gives you the flexibility to retain topic data for longer term (default 7 days). This supports replay, analytics, and machine learning based use cases. When IT and business systems are producing and processing terabytes of data per hour, it can become expensive to store, manage, and retrieve data. This has led to legacy data archival processes moving towards cheaper, reliable, and long-term storage solutions like Amazon Simple Storage Service (S3).

Following are some of the benefits of archiving Amazon MSK topic data to Amazon S3:

Reduced Cost – You only must retain the data in the cluster based on your Recovery Point Objective (RPO). Any historical data can be archived in Amazon S3 and replayed if necessary.
Integration with Enterprise Data Lake – Since your data is available in S3, you can now integrate with other data analytics services like Amazon EMR, AWS Glue, Amazon Athena, to run data aggregation and analytics. For example, you can build reports to visualize month over month changes.
Optimize Machine Learning Workloads – Machine learning applications will be able to train new models and improve predictions using historical streams of data available in Amazon S3. This also enables better integration with Amazon Machine Learning services.
Compliance – Long-term data archival for regulatory and security compliance.
Backloading data to other systems – Ability to rebuild data into other application environments such as pre-prod, testing, and more.

There are many benefits to using Amazon S3 as long-term storage for Amazon MSK topics. Let’s dive deeper into the recommended architecture for this pattern. We will present an architecture to back up Amazon MSK topics to Amazon S3 in real time. In addition, we’ll demonstrate some of the use cases previously mentioned.

Architecture

The diagram following illustrates the architecture for building a real-time archival pipeline to archive Amazon MSK topics to S3. This architecture uses an AWS Lambda function to process records from your Amazon MSK cluster when the cluster is configured as an event source. As a consumer, you don’t need to worry about infrastructure management or scaling with Lambda. You only pay for what you consume, so you don’t pay for over-provisioned infrastructure.

To create an event source mapping, you can add your Amazon MSK cluster in a Lambda function trigger. The Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches from one or more partitions and provides these to your function as an event payload. The function then processes records, and sends the payload to an Amazon Kinesis Data Firehose delivery stream. We use Kinesis Data Firehose delivery stream because it can natively batch, compress, transform, and encrypt your events before loading to S3.

In this architecture, Kinesis Data Firehose delivers the records received from Lambda in Gzip file to Amazon S3. These files are partitioned in hive style format by Kinesis Data Firehose:

data/year = yyyy/month = MM/day = dd/hour = HH

Figure 1. Archival Architecture

Let’s review some of the possible solutions that can be built on this archived data.

Integration with Enterprise Data Lake

The architecture diagram following shows how you can integrate the archived data in Amazon S3 with your Enterprise Data Lake. Since the data files are prefixed in hive style format, you can partition and store the Data Catalog in AWS Glue. With partitioning in place, you can perform optimizations like partition pruning, which enables predicate pushdown for improved performance of your analytics queries. You can also use AWS Data Analytics services like Amazon EMR and AWS Glue for batch analytics. Amazon Athena can be used to run serverless SQL-like interactive queries on visualization and data.

Data currently gets stored in JSON files. Following are some of the services/tools that can be integrated with your archive for reporting, analytics, visualization, and machine learning requirements.

Figure 2. Analytics Architecture

Cloning data into other application environments

There are use cases where you would want to use this data to clone other application environments using this archive.

These clusters could be used for testing or debugging purposes. You could decide to use only a subset of your data from the archive. Let’s say you want to debug an issue beyond the configured retention period, but not replicate all the data to your testing environment. With archived data in S3, you can build downstream jobs to filter data that can be loaded into a new Amazon MSK cluster. The following diagram highlights this pattern:

Figure 3. Replay Architecture

Ready for a Test Drive

To help you get started, we would like to introduce an AWS Solution: AWS Streaming Data Solution for Amazon MSK (scroll down and see Option 3 tab). There is a single-click AWS CloudFormation template, which can assist you in quickly provisioning resources. This will get your real-time archival pipeline for Amazon MSK up and running quickly. This solution shortens your development time by removing or reducing the need for you to:

Model and provision resources using AWS CloudFormation
Set up Amazon CloudWatch alarms, dashboards, and logging
Manually implement streaming data best practices in AWS

This solution is data and logic agnostic, enabling you to start with boilerplate code and start customizing quickly. After deployment, use this solution’s monitoring capabilities to transition easily to production.

Conclusion

In this post, we explained the architecture to build a scalable, highly available real-time archival of Amazon MSK topics to long term storage in Amazon S3. The architecture was built using Amazon MSK, AWS Lambda, Amazon Kinesis Data Firehose, and Amazon S3. The architecture also illustrates how you can integrate your Amazon MSK streaming data in S3 with your Enterprise Data Lake.

Validate, evolve, and control schemas in Amazon MSK and Amazon Kinesis Data Streams with AWS Glue Schema Registry

2021-01-13 Brian Likosar

Post Syndicated from Brian Likosar original https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/

Data streaming technologies like Apache Kafka and Amazon Kinesis Data Streams capture and distribute data generated by thousands or millions of applications, websites, or machines. These technologies serve as a highly available transport layer that decouples the data-producing applications from data processors. However, the sheer number of applications producing, processing, routing, and consuming data can make it hard to coordinate and evolve data schemas, like adding or removing a data field, without introducing data quality issues and downstream application failures. Developers often build complex tools, write custom code, or rely on documentation, change management, and Wikis to protect against schema changes. This is quite error prone because it relies too heavily on human oversight. A common solution with data streaming technologies is a schema registry that provides for validation of schema changes to allow for safe evolution as business needs adjust over time.

AWS Glue Schema Registry, a serverless feature of AWS Glue, enables you to validate and reliably evolve streaming data against Apache Avro schemas at no additional charge. Through Apache-licensed serializers and deserializers, the Glue Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

This post explains the benefits of using the Glue Schema Registry and provides examples of how to use it with both Apache Kafka and Kinesis Data Streams.

With the Glue Schema Registry, you can eliminate defensive coding and cross-team coordination, improve data quality, reduce downstream application failures, and use a registry that is integrated across multiple AWS services. Each schema can be versioned within the guardrails of a compatibility mode, providing developers the flexibility to reliably evolve schemas. Additionally, the Glue Schema Registry can serialize data into a compressed format, helping you save on data transfer and storage costs.

Although there are many ways to leverage the Glue Schema Registry (including using the API to build your own integrations), in this post, we show two use cases. The Schema Registry is a free feature that can significantly improve data quality and developer productivity. If you use Avro schemas, you should be using the Schema Registry to supplement your solutions built on Apache Kafka (including Amazon MSK) or Kinesis Data Streams. The following diagram illustrates this architecture.

AWS Glue Schema Registry features

Glue Schema Registry has the following features:

Schema discovery – When a producer registers a schema change, metadata can be applied as a key-value pair to provide searchable information for administrators or developers. This metadata can indicate the original source of the data (source=MSK_west), the team name to contact (owner=DataEngineering), or AWS tags (environment=Production). You could potentially encrypt a field in your data on the producing client and use metadata to specify to potential consumer clients which public key fingerprint to use for decryption.
Schema compatibility – The versioning of each schema is governed by a compatibility mode. If a new version of a schema is requested to be registered that breaks the specified compatibility mode, the request fails and an exception is thrown. Compatibility checks enable developers building downstream applications to have a bounded set of scenarios to build applications against, which helps to prepare for the changes without issue. Commonly used modes are FORWARD, BACKWARD, and FULL. For more information about mode definitions, see Schema Versioning and Compatibility.
Schema validation – Glue Schema Registry serializers work to validate that the schema used during data production is compatible. If it isn’t, the data producer receives an exception from the serializer. This ensures that potentially breaking changes are found earlier in development cycles, and can also help prevent unintentional schema changes due to human error.
Auto-registration of schemas – If configured to do so, the producer of data can auto-register schema changes as they flow in the data stream. This is especially useful for use cases where the source of the data is change data capture from a database.
IAM support – Thanks to integrated AWS Identity and Access Management (IAM) support, only authorized producers can change certain schemas. Furthermore, only those consumers authorized to read the schema can do so. Schema changes are typically performed deliberately and with care, so it’s important to use IAM to control who performs these changes. Additionally, access control to schemas is important in situations where you might have sensitive information included in the schema definition itself. In the examples that follow, IAM roles are inferred via the AWS SDK for Java, so they are inherited from the Amazon Elastic Compute Cloud (Amazon EC2) instance’s role that the application runs in. IAM roles can also be applied to any other AWS service that could contain this code, such as containers or Lambda functions.
Integrations and other support – The provided serializers and deserializers are currently for Java clients using Apache Avro for data serialization. The GitHub repo also contains support for Apache Kafka Streams, Apache Kafka Connect, and Apache Flink—all licensed using the Apache License 2.0. We’re already working on additional language and data serialization support, but we need your feedback on what you’d like to see next.
Secondary deserializer – If you have already registered schemas in another schema registry, there’s an option for specifying a secondary deserializer when performing schema lookups. This allows for migrations from other schema registries without having to start anew. If the schema ID being used isn’t known to the Glue Schema Registry, it’s looked for in the secondary deserializer.
Compression – Using the Avro format already reduces message size due to its compact, binary format. Using a schema registry can further reduce data payload by no longer needing to send and receive schemas with each message. Glue Schema Registry libraries also provide an option for zlib compression, which can reduce data requirements even further by compressing the payload of the message. This varies by use case, but compression can reduce the size of the message significantly.

Example schema

For this post, we use the following schema to begin each of our use cases:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"}
 ]
}

Using AWS Glue Schema Registry with Amazon MSK and Apache Kafka

You can use the following Apache Kafka producer code to produce Apache Avro formatted messages to a topic with the preceding schema:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSKafkaAvroSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSAvroSerializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Parser;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;

public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.SCHEMA_NAME, "customer");
properties.put(AWSSchemaRegistryConstants.COMPATIBILITY_SETTING, Compatibility.FULL);
properties.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
Schema schema_customer = new Parser().parse(new File("Customer.avsc"));
GenericRecord customer = new GenericData.Record(schema_customer);

try (KafkaProducer<String, GenericRecord> producer = new KafkaProducer<String, GenericRecord>(properties)) {
final ProducerRecord<String, GenericRecord> record = new ProducerRecord<String, GenericRecord>(topic, customer);
customer.put("first_name", "Ada");
customer.put("last_name", "Lovelace");
customer.put("full_name", "Ada Lovelace");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Sue");
customer.put("last_name", "Black");
customer.put("full_name", "Sue Black");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Anita");
customer.put("last_name", "Borg");
customer.put("full_name", "Anita Borg");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Grace");
customer.put("last_name", "Hopper");
customer.put("full_name", "Grace Hopper");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Neha");
customer.put("last_name", "Narkhede");
customer.put("full_name", "Neha Narkhede");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);
producer.flush();
System.out.println("Successfully produced 5 messages to a topic called " + topic);
} catch (final InterruptedException | SerializationException e) {
e.printStackTrace();
}
}
}

Use the following Apache Kafka consumer code to look up the schema information while consuming from a topic to learn the schema details:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSKafkaAvroDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSAvroDeserializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import java.util.Collections;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;


public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "gsr-client");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());

try (final KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<String, GenericRecord>(properties)) {
consumer.subscribe(Collections.singletonList(topic));
while (true) {
final ConsumerRecords<String, GenericRecord> records = consumer.poll(1000);
for (final ConsumerRecord<String, GenericRecord> record : records) {
final GenericRecord value = record.value();
System.out.println("Received message: value = " + value);
}
			}
} catch (final SerializationException e) {
e.printStackTrace();
}
}
}

Using AWS Glue Schema Registry with Kinesis Data Streams

You can use the following Kinesis Producer Library (KPL) code to publish messages in Apache Avro format to a Kinesis data stream with the preceding schema:

private static final String SCHEMA_DEFINITION = "{"namespace": "Customer.avro",\n"
+ " "type": "record",\n"
+ " "name": "Customer",\n"
+ " "fields": [\n"
+ " {"name": "first_name", "type": "string"},\n"
+ " {"name": "last_name", "type": "string"}\n"
+ " ]\n"
+ "}";

KinesisProducerConfiguration config = new KinesisProducerConfiguration();
config.setRegion("us-west-1")

//[Optional] configuration for Schema Registry.

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration("us-west-1");

schemaRegistryConfig.setCompression(true);

config.setGlueSchemaRegistryConfiguration(schemaRegistryConfig);

///Optional configuration ends.

final KinesisProducer producer = 
new KinesisProducer(config);

final ByteBuffer data = getDataToSend();

com.amazonaws.services.schemaregistry.common.Schema gsrSchema = 
new Schema(SCHEMA_DEFINITION, DataFormat.AVRO.toString(), "demoSchema");

ListenableFuture<UserRecordResult> f = producer.addUserRecord(
config.getStreamName(), TIMESTAMP, Utils.randomExplicitHashKey(), data, gsrSchema);

private static ByteBuffer getDataToSend() {
org.apache.avro.Schema avroSchema = 
new org.apache.avro.Schema.Parser().parse(SCHEMA_DEFINITION);

GenericRecord user = new GenericData.Record(avroSchema);
user.put("name", "Emily");
user.put("favorite_number", 32);
user.put("favorite_color", "green");

ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().directBinaryEncoder(outBytes, null);
new GenericDatumWriter<>(avroSchema).write(user, encoder);
encoder.flush();
return ByteBuffer.wrap(outBytes.toByteArray());
}

On the consumer side, you can use the Kinesis Client Library (KCL) (v2.3 or later) to look up schema information while retrieving messages from a Kinesis data stream:

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration(this.region.toString());

 GlueSchemaRegistryDeserializer glueSchemaRegistryDeserializer = 
new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), schemaRegistryConfig);

 RetrievalConfig retrievalConfig = configsBuilder.retrievalConfig().retrievalSpecificConfig(new PollingConfig(streamName, kinesisClient));
 retrievalConfig.glueSchemaRegistryDeserializer(glueSchemaRegistryDeserializer);
 
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
retrievalConfig
);

 public void processRecords(ProcessRecordsInput processRecordsInput) {
MDC.put(SHARD_ID_MDC_KEY, shardId);
try {
log.info("Processing {} record(s)", 
processRecordsInput.records().size());
processRecordsInput.records()
.forEach(
r -> 
log.info("Processed record pk: {} -- Seq: {} : data {} with schema: {}", 
r.partitionKey(), r.sequenceNumber(), recordToAvroObj(r).toString(), r.getSchema()));
} catch (Throwable t) {
log.error("Caught throwable while processing records. Aborting.");
Runtime.getRuntime().halt(1);
} finally {
MDC.remove(SHARD_ID_MDC_KEY);
}
 }
 
 private GenericRecord recordToAvroObj(KinesisClientRecord r) {
byte[] data = new byte[r.data().remaining()];
r.data().get(data, 0, data.length);
org.apache.avro.Schema schema = new org.apache.avro.Schema.Parser().parse(r.schema().getSchemaDefinition());
DatumReader datumReader = new GenericDatumReader<>(schema);

BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(data, 0, data.length, null);
return (GenericRecord) datumReader.read(null, binaryDecoder);
 }

Example of schema evolution

As a producer, let’s say you want to add an additional field to our schema:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"},
 {"name": "full_name", "type": ["string", “null”], “default”: null}
]
}

Regardless of whether you’re following the Apache Kafka or Kinesis Data Streams example, you can use the previously provided producer code to publish new messages using this new schema version with the full_name field. This is simply a concatenation of first_name and last_name.

This schema change added an optional field (full_name), which is indicated by the type field having an option of null in addition to string with a default of null. In adding this optional field, we’ve created a schema evolution. This qualifies as a FORWARD compatible change because the producer has modified the schema and the consumer can read without updating its version of the schema. It’s a good practice to provide a default for a given field. This allows for its eventual removal if necessary. If it’s removed by the producer, the consumer uses the default that it knew for that field from before the removal.

This change is also a BACKWARD compatible change, because if the consumer changes the schema it expects to receive, it can use that default to fill in the value for the field it isn’t receiving. By being both FORWARD and BACKWARD compatible, it is therefore a FULL compatible change. The Glue Schema Registry serializers default to BACKWARD compatible, so we have to add a line declaring it as FULL.

In looking at the full option set, you may find FORWARD_ALL, BACKWARD_ALL, and FULL_ALL. These typically only come into play when you want to change data types for a field whose name you don’t change. The most common observed compatibility mode is BACKWARD, which is why it’s the default.

As a consumer application, however, you don’t want to have to recompile your application to handle the addition of a new field. If you want to reference the customer by full name, that’s your choice in your app instead of being forced to consume the new field and use it. When you consume the new messages you’ve just produced, your application doesn’t crash or have problems, because it’s still using the prior version of the schema, and that schema change is compatible with your application. To experience this in action, run the consumer code in one window and don’t interrupt it. As you run the producer application again, this time with messages following the new schema, you can still see output without issue, thanks to the Glue Schema Registry.

Conclusion

In this post, we discussed the benefits of using the Glue Schema Registry to register, validate, and evolve schemas for data streams as business needs change. We also provided examples of how to use Glue Schema Registry with Apache Kafka and Kinesis Data Streams.

For more information and to get started, see AWS Glue Schema Registry.

About the Authors

Brian Likosar is a Senior Streaming Specialist Solutions Architect at Amazon Web Services. Brian loves helping customers capture value from real-time streaming architectures, because he knows life doesn’t happen in batch. He’s a big fan of open-source collaboration, theme parks, and live music.

Larry Heathcote is a Senior Product Marketing Manager at Amazon Web Services for data streaming and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys walking his Samoyed Sasha in the mornings so she can look for squirrels to bark at.

Using self-hosted Apache Kafka as an event source for AWS Lambda

2020-12-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-self-hosted-apache-kafka-as-an-event-source-for-aws-lambda/

Apache Kafka is an open source event streaming platform used to support workloads such as data pipelines and streaming analytics. Apache Kafka is a distributed streaming platform that it is conceptually similar to Amazon Kinesis.

With the launch of Kafka as an event source for Lambda, you can now consume messages from a topic in a Lambda function. This makes it easier to integrate your self-hosted Kafka clusters with downstream serverless workflows.

In this blog post, I explain how to set up an Apache Kafka cluster on Amazon EC2 and configure key elements in the networking configuration. I also show how to create a Lambda function to consume messages from a Kafka topic. Although the process is similar to using Amazon Managed Streaming for Apache Kafka (Amazon MSK) as an event source, there are also some important differences.

Overview

Using Kafka as an event source operates in a similar way to using Amazon SQS or Amazon Kinesis. In all cases, the Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides the message batches to your function in the event payload.

Lambda is a consumer application for your Kafka topic. It processes records from one or more partitions and sends the payload to the target function. Lambda continues to process batches until there are no more messages in the topic.

Configuring networking for self-hosted Kafka

It’s best practice to deploy the Amazon EC2 instances running Kafka in private subnets. For the Lambda function to poll the Kafka instances, you must ensure that there is a NAT Gateway running in the public subnet of each Region.

It’s possible to route the traffic to a single NAT Gateway in one AZ for test and development workloads. For redundancy in production workloads, it’s recommended that there is one NAT Gateway available in each Availability Zone. This walkthrough creates the following architecture:

Deploy a VPC with public and private subnets and a NAT Gateway that enables internet access. To configure this infrastructure with AWS CloudFormation, deploy this template.
From the VPC console, edit the default security group created by this template to provide inbound access to the following ports:
- Custom TCP: ports 2888–3888 from all sources.
- SSH (port 22), restricted to your own IP address.
- Custom TCP: port 2181 from all sources.
- Custom TCP: port 9092 from all sources.
- All traffic from the same security group identifier.

Deploying the EC2 instances and installing Kafka

Next, you deploy the EC2 instances using this network configuration and install the Kafka application:

From the EC2 console, deploy an instance running Ubuntu Server 18.04 LTS. Ensure that there is one instance in each private subnet, in different Availability Zones. Assign the default security group configured by the template.
Next, deploy another EC2 instance in either of the public subnets. This is a bastion host used to access the private instances. Assign the default security group configured by the template.
Connect to the bastion host, then SSH to the first private EC2 instance using the method for your preferred operating system. This post explains different methods. Repeat the process in another terminal for the second private instance.

On each instance, install Java:

sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install openjdk-8-jdk
java –version

On each instance, install Kafka:

wget http://www-us.apache.org/dist/kafka/2.3.1/kafka_2.12-2.3.1.tgz
tar xzf kafka_2.12-2.3.1.tgz
ln -s kafka_2.12-2.3.1 kafka

Configure and start Zookeeper

Configure and start the Zookeeper service that manages the Kafka brokers:

On the first instance, configure the Zookeeper ID:

cd kafka
mkdir /tmp/zookeeper
touch /tmp/zookeeper/myid
echo "1" >> /tmp/zookeeper/myid

Repeat the process on the second instance, using a different ID value:

cd kafka
mkdir /tmp/zookeeper
touch /tmp/zookeeper/myid
echo "2" >> /tmp/zookeeper/myid

On the first instance, edit the config/zookeeper.properties file, adding the private IP address of the second instance:

initLimit=5
syncLimit=2
tickTime=2000
# list of servers: <ip>:2888:3888
server.1=0.0.0.0:2888:3888 
server.2=<<IP address of second instance>>:2888:3888

On the second instance, edit the config/zookeeper.properties file, adding the private IP address of the first instance:

initLimit=5
syncLimit=2
tickTime=2000
# list of servers: <ip>:2888:3888
server.1=<<IP address of first instance>>:2888:3888 
server.2=0.0.0.0:2888:3888

On each instance, start Zookeeper:bin/zookeeper-server-start.sh config/zookeeper.properties

Configure and start Kafka

Configure and start the Kafka broker:

On the first instance, edit the config/server.properties file:
broker.id=1
zookeeper.connect=0.0.0.0:2181, =<<IP address of second instance>>:2181
On the second instance, edit the config/server.properties file:
broker.id=2
zookeeper.connect=0.0.0.0:2181, =<<IP address of first instance>>:2181
Start Kafka on each instance:
bin/kafka-server-start.sh config/server.properties

At the end of this process, Zookeeper and Kafka are running on both instances. If you use separate terminals, it looks like this:

Configuring and publishing to a topic

Kafka organizes channels of messages around topics, which are virtual groups of one or many partitions across Kafka brokers in a cluster. Multiple producers can send messages to Kafka topics, which can then be routed to and processed by multiple consumers. Producers publish to the tail of a topic and consumers read the topic at their own pace.

From either of the two instances:

Create a new topic called test:

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic test

Start a producer:

bin/kafka-console-producer.sh --broker-list localhost:9092 –topic

Enter test messages to check for successful publication:

At this point, you can successfully publish messages to your self-hosted Kafka cluster. Next, you configure a Lambda function as a consumer for the test topic on this cluster.

Configuring the Lambda function and event source mapping

You can create the Lambda event source mapping using the AWS CLI or AWS SDK, which provide the CreateEventSourceMapping API. In this walkthrough, you use the AWS Management Console to create the event source mapping.

Create a Lambda function that uses the self-hosted cluster and topic as an event source:

From the Lambda console, select Create function.
Enter a function name, and select Node.js 12.x as the runtime.
Select the Permissions tab, and select the role name in the Execution role panel to open the IAM console.

Choose Add inline policy and create a new policy called SelfHostedKafkaPolicy with the following permissions. Replace the resource example with the ARNs of your instances:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeVpcs",
                "ec2:DeleteNetworkInterface",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": " arn:aws:ec2:<REGION>:<ACCOUNT_ID>:instance/<instance-id>"
        }
    ]
}

Choose Create policy and ensure that the policy appears in Permissions policies.
Back in the Lambda function, select the Configuration tab. In the Designer panel, choose Add trigger.
In the dropdown, select Apache Kafka:
- For Bootstrap servers, add each of the two instances private IPv4 DNS addresses with port 9092 appended.
- For Topic name, enter ‘test’.
- Enter your preferred batch size and starting position values (see this documentation for more information).
- For VPC, select the VPC created by the template.
- For VPC subnets, select the two private subnets.
- For VPC security groups, select the default security group.
- Choose Add.

The trigger’s status changes to Enabled in the Lambda console after a few seconds. It then takes several minutes for the trigger to receive messages from the Kafka cluster.

Testing the Lambda function

At this point, you have created a VPC with two private and public subnets and a NAT Gateway. You have created a Kafka cluster on two EC2 instances in private subnets. You set up a target Lambda function with the necessary IAM permissions. Next, you publish messages to the test topic in Kafka and see the resulting invocation in the logs for the Lambda function.

In the Function code panel, replace the contents of index.js with the following code and choose Deploy:

exports.handler = async (event) => {
    // Iterate through keys
    for (let key in event.records) {
      console.log('Key: ', key)
      // Iterate through records
      event.records[key].map((record) => {
        console.log('Record: ', record)
        // Decode base64
        const msg = Buffer.from(record.value, 'base64').toString()
        console.log('Message:', msg)
      }) 
    }
}

Back in the terminal with the producer script running, enter a test message:
In the Lambda function console, select the Monitoring tab then choose View logs in CloudWatch. In the latest log stream, you see the original event and the decoded message:

Using Lambda as event source

The Lambda function target in the event source mapping does not need to be connected to a VPC to receive messages from the private instance hosting Kafka. However, you must provide details of the VPC, subnets, and security groups in the event source mapping for the Kafka cluster.

The Lambda function must have permission to describe VPCs and security groups, and manage elastic network interfaces. These execution roles permissions are:

ec2:CreateNetworkInterface
ec2:DescribeNetworkInterfaces
ec2:DescribeVpcs
ec2:DeleteNetworkInterface
ec2:DescribeSubnets
ec2:DescribeSecurityGroups

The event payload for the Lambda function contains an array of records. Each array item contains details of the topic and Kafka partition identifier, together with a timestamp and base64 encoded message:

There is an important difference in the way the Lambda service connects to the self-hosted Kafka cluster compared with Amazon MSK. MSK encrypts data in transit by default so the broker connection defaults to using TLS. With a self-hosted cluster, TLS authentication is not supported when using the Apache Kafka event source. Instead, if accessing brokers over the internet, the event source uses SASL/SCRAM authentication, which can be configured in the event source mapping:

To learn how to configure SASL/SCRAM authentication your self-hosted Kafka cluster, see this documentation.

Conclusion

Lambda now supports self-hosted Kafka as an event source so you can invoke Lambda functions from messages in Kafka topics to integrate into other downstream serverless workflows.

This post shows how to configure a self-hosted Kafka cluster on EC2 and set up the network configuration. I also cover how to set up the event source mapping in Lambda and test a function to decode the messages sent from Kafka.

To learn more about how to use this feature, read the documentation. For more serverless learning resource, visit Serverless Land.