Tag Archives: Technical How-to

Resolve private DNS hostnames for Amazon MSK Connect

Post Syndicated from Amar Surjit original https://aws.amazon.com/blogs/big-data/resolve-private-dns-hostnames-for-amazon-msk-connect/

Amazon MSK Connect is a feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK) that offers a fully managed Apache Kafka Connect environment on AWS. With MSK Connect, you can deploy fully managed connectors built for Kafka Connect that move data into or pull data from popular data stores like Amazon S3 and Amazon OpenSearch Service. With the introduction of the Private DNS support into MSK Connect, connectors are able to resolve private customer domain names, using their DNS servers configured in the customer VPC DHCP Options set. This post demonstrates a solution for resolving private DNS hostnames defined in a customer VPC for MSK Connect.

You may want to use private DNS hostname support for MSK Connect for multiple reasons. Before the private DNS resolution capability included with MSK Connect, it used the service VPC DNS resolver for DNS resolution. MSK Connect didn’t use the private DNS servers defined in the customer VPC DHCP option sets for DNS resolution. The connectors were only able to reference hostnames in the connector configuration or plugin that are publicly resolvable and couldn’t resolve private hostnames defined in either a private hosted zone or use DNS servers in another customer network.

Many customers ensure that their internal DNS applications are not publicly resolvable. For example, you might have a MySQL or PostgreSQL database and may not want the DNS name for your database to be publicly resolvable or accessible. Amazon Relational Database Service (Amazon RDS) or Amazon Aurora servers have DNS names that are publicly resolvable but not accessible. You can have multiple internal applications such as databases, data warehouses, or other systems where DNS names are not publicly resolvable.

With the recent launch of MSK Connect private DNS support, you can configure connectors to reference public or private domain names. Connectors use the DNS servers configured in your VPC’s DHCP option set to resolve domain names. You can now use MSK Connect to privately connect with databases, data warehouses, and other resources in your VPC to comply with your security needs.

If you have a MySQL or PostgreSQL database with private DNS, you can configure it on a custom DNS server and configure the VPC-specific DHCP option set to do the DNS resolution using the custom DNS server local to the VPC instead of using the service DNS resolution.

Solution overview

A customer can have different architecture options to set up their MSK Connect. For example, they can have Amazon MSK and MSK Connect are in the same VPC or source system in VPC1 and Amazon MSK and MSK Connect are in VPC2 or source system, Amazon MSK and MSK Connect are all in different VPCs.

The following setup uses two different VPCs, where the MySQL VPC hosts the MySQL database and the MSK VPC hosts Amazon MSK, MSK Connect, the DNS server, and various other components. You can extend this architecture to support other deployment topologies using appropriate AWS Identity and Access Management (IAM) permissions and connectivity options.

This post provides step-by-step instructions to set up MSK Connect where it will receive data from a source MySQL database with private DNS hostname in the MySQL VPC and send data to Amazon MSK using MSK Connect in another VPC. The following diagram illustrates the high-level architecture.

The setup instructions include the following key steps:

  1. Set up the VPCs, subnets, and other core infrastructure components.
  2. Install and configure the DNS server.
  3. Upload the data to the MySQL database.
  4. Deploy Amazon MSK and MSK Connect and consume the change data capture (CDC) records.

Prerequisites

To follow the tutorial in this post, you need the following:

Create the required infrastructure using AWS CloudFormation

Before configuring the MSK Connect, we need to set up the VPCs, subnets, and other core infrastructure components. To set up resources in your AWS account, complete the following steps:

  1. Choose Launch Stack to launch the stack in a Region that supports Amazon MSK and MSK Connect.
  2. Specify the private key that you use to connect to the EC2 instances.
  3. Update the SSH location with your local IP address and keep the other values as default.
  4. Choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create stack and wait for the required resources to get created.

The CloudFormation template creates the following key resources in your account:

  • VPCs:
    • MSK VPC
    • MySQL VPC
  • Subnets in the MSK VPC:
    • Three private subnets for Amazon MSK
    • Private subnet for DNS server
    • Private subnet for MSKClient
    • Public subnet for bastion host
  • Subnets in the MySQL VPC:
    • Private subnet for MySQL database
    • Public subnet for bastion host
  • Internet gateway attached to the MySQL VPC and MSK VPC
  • NAT gateways attached to MySQL public subnet and MSK public subnet
  • Route tables to support the traffic flow between different subnets in a VPC and across VPCs
  • Peering connection between the MySQL VPC and MSK VPC
  • MySQL database and configurations
  • DNS server
  • MSK client with respective libraries

Please note, if you’re using VPC peering or AWS Transit Gateway with MSK Connect, don’t configure your connector for reaching the peered VPC resources with IPs in the CIDR ranges. For more information, refer to Connecting from connectors.

Configure the DNS server

Complete the following steps to configure the DNS server:

  1. Connect to the DNS server. There are three configuration files available on the DNS server under the /home/ec2-user folder:
    • named.conf
    • mysql.internal.zone
    • kafka.us-east-1.amazonaws.com.zone
  2. Run the following commands to install and configure your DNS server:
    sudo yum install bind bind-utils –y
    cp /home/ec2-user/named.conf /etc/named.conf
    chmod 644 /etc/named.conf
    cp mysql.internal.zone /var/named/mysql.internal.zone
    cp kafka.region.amazonaws.com.zone /var/named/kafka.region.amazonaws.com.zone
    

  3. Update /etc/named.conf.

For the allow-transfer attribute, update the DNS server internal IP address to allow-transfer

{ localhost; <DNS Server internal IP address>; };.

You can find the DNS server IP address on the CloudFormation template Outputs tab.

Note that the MSK cluster is still not set up at this stage. We need to update the Kafka broker DNS names and their respective internal IP addresses in the /var/named/kafka.region.amazonaws.com configuration file after setting up the MSK cluster later in this post. For instructions, refer to here.

Also note that these settings configure the DNS server for this post. In your own environment, you can configure the DNS server as per your needs.

  1. Restart the DNS service:
    sudo su
    service named restart
    

You should see the following message:

Redirecting to /bin/systemctl restart named.service

Your custom DNS server is up and running now.

Upload the data to the MySQL database

Typically, we can use an Amazon RDS for MySQL database, but for this post, we use custom MySQL database servers. The Amazon RDS DNS is publicly accessible and MSK Connect supports it, but it was not able to support databases or applications with private DNS in the past. With the latest private DNS hostnames feature launch, it can support applications’ private DNS as well, so we use a MySQL database on the EC2 instance.

This installation provides information about setting up the MySQL database on a single-node EC2 instance. This should not be used for your production setup. You should follow appropriate guidance for setting up and configuring MySQL in your account.

The MySQL database is already set up using the CloudFormation template and is ready to use now. To upload the data, complete the followings steps:

  1. SSH to the MySQL EC2 instance. For instructions, refer to Connect to your Linux instance. The data file salesdb.sql is already downloaded and available under the /home/ec2-user directory.
  2. Log in to mysqldb with the user name master.
  3. To access the password, navigate to AWS Systems Manager and Parameter Store tab. Select /Database/Credentials/master and click on View Details and copy the value for the key.
  4. Log in to MySQL using the following command:
    mysql -umaster -p<MySQLMasterUserPassword>
    

  5. Run the following commands to create the salesdb database and load the data to the table:
    use salesdb;
    source /home/ec2-user/salesdb.sql;

This will insert the records in various different tables in the salesdb database.

  1. Run show tables to see the following tables in the salesdb:
    mysql> show tables;
    +-----------------------+
    | Tables_in_salesdb |
    +-----------------------+
    | CUSTOMER |
    | CUSTOMER_SITE |
    | PRODUCT |
    | PRODUCT_CATEGORY |
    | SALES_ORDER |
    | SALES_ORDER_ALL |
    | SALES_ORDER_DETAIL |
    | SALES_ORDER_DETAIL_DS |
    | SALES_ORDER_V |
    | SUPPLIER |
    +-----------------------+

Create a DHCP option set

DHCP option sets give you control over the following aspects of routing in your virtual network:

  • You can control the DNS servers, domain names, or Network Time Protocol (NTP) servers used by the devices in your VPC.
  • You can disable DNS resolution completely in your VPC.

To support private DNS, you can use an Amazon Route 53 private zone or your own custom DNS server. If you use a Route 53 private zone, the setup will work automatically and there is no need to make any changes to the default DHCP option set for the MSK VPC. For a custom DNS server, complete the following steps to set up a custom DHCP configuration using Amazon Virtual Private Cloud (Amazon VPC) and attach it to the MSK VPC.

There will be a default DHCP option set in your VPC attached to the Amazon provided DNS server. At this stage, the requests will go to Amazon’s provided DNS server for resolution. However, we create a new DHCP option set because we’re using a custom DNS server.

  1. On the Amazon VPC console, choose DHCP option set in the navigation pane.
  2. Choose Create DHCP option set.
  3. For DHCP option set name, enter MSKConnect_Private_DHCP_OptionSet.
  4. For Domain name, enter mysql.internal.
  5. For Domain name server, enter the DNS server IP address.
  6. Choose Create DHCP option set.
  7. Navigate to the MSK VPC and on the Actions menu, choose Edit VPC settings.
  8. Select the newly created DHCP option set and save it.
    The following screenshot shows the example configurations.
  9. On the Amazon EC2 console, navigate to privateDNS_bastion_host.
  10. Choose Instance state and Reboot instance.
  11. Wait a few minutes and then run nslookup from the bastion host; it should be able to resolve it using your local DNS server instead of Route 53:
nslookup local.mysql.internal

Now our base infrastructure setup is ready to move to the next stage. As part of our base infrastructure, we have set up the following key components successfully:

  • MSK and MySQL VPCs
  • Subnets
  • EC2 instances
  • VPC peering
  • Route tables
  • NAT gateways and internet gateways
  • DNS server and configuration
  • Appropriate security groups and NACLs
  • MySQL database with the required data

At this stage, the MySQL DB DNS name is resolvable using a custom DNS server instead of Route 53.

Set up the MSK cluster and MSK Connect

The next step is to deploy the MSK cluster and MSK Connect, which will fetch records from the salesdb and send it to an Amazon Simple Storage Service (Amazon S3) bucket. In this section, we provide a walkthrough of replicating the MySQL database (salesdb) to Amazon MSK using Debezium, an open-source connector. The connector will monitor for any changes to the database and capture any changes to the tables.

With MSK Connect, you can run fully managed Apache Kafka Connect workloads on AWS. MSK Connect provisions the required resources and sets up the cluster. It continuously monitors the health and delivery state of connectors, patches and manages the underlying hardware, and auto scales connectors to match changes in throughput. As a result, you can focus your resources on building applications rather than managing infrastructure.

MSK Connect will make use of the custom DNS server in the VPC and it won’t be dependent on Route 53.

Create an MSK cluster configuration

Complete the following steps to create an MSK cluster:

  1. On the Amazon MSK console, choose Cluster configurations under MSK clusters in the navigation pane.
  2. Choose Create configuration.
  3. Name the configuration mskc-tutorial-cluster-configuration.
  4. Under Configuration properties, remove everything and add the line auto.create.topics.enable=true.
  5. Choose Create.

Create an MSK cluster and attach the configuration

In the next step, we attach this configuration to a cluster. Complete the following steps:

  1. On the Amazon MSK console, choose Clusters under MSK clusters in the navigation pane.
  2. Choose Create clusters and Custom create.
  3. For the cluster name, enter mkc-tutorial-cluster.
  4. Under General cluster properties, choose Provisioned for the cluster type and use the Apache Kafka default version 2.8.1.
  5. Use all the default options for the Brokers and Storage sections.
  6. Under Configurations, choose Custom configuration.
  7. Select mskc-tutorial-cluster-configuration with the appropriate revision and choose Next.
  8. Under Networking, choose the MSK VPC.
  9. Select the Availability Zones depending upon your Region, such as us-east1a, us-east1b, and us-east1c, and the respective private subnets MSK-Private-1, MSK-Private-2, and MSK-Private-3 if you are in the us-east-1 Region. Public access to these brokers should be off.
  10. Copy the security group ID from Chosen security groups.
  11. Choose Next.
  12. Under Access control methods, select IAM role-based authentication.
  13. In the Encryption section, under Between clients and brokers, TLS encryption will be selected by default.
  14. For Encrypt data at rest, select Use AWS managed key.
  15. Use the default options for Monitoring and select Basic monitoring.
  16. Select Deliver to Amazon CloudWatch Logs.
  17. Under Log group, choose visit Amazon CloudWatch Logs console.
  18. Choose Create log group.
  19. Enter a log group name and choose Create.
  20. Return to the Monitoring and tags page and under Log groups, choose Choose log group
  21. Choose Next.
  22. Review the configurations and choose Create cluster. You’re redirected to the details page of the cluster.
  23. Under Security groups applied, note the security group ID to use in a later step.

Cluster creation can typically take 25–30 minutes. Its status changes to Active when it’s created successfully.

Update the /var/named/kafka.region.amazonaws.com zone file

Before you create the MSK connector, update the DNS server configurations with the MSK cluster details.

  1. To get the list of bootstrap server DNS and respective IP addresses, navigate to the cluster and choose View client information.
  2. Copy the bootstrap server information with IAM authentication type.
  3. You can identify the broker IP addresses using nslookup from your local machine and it will provide you the broker local IP address. Currently, your VPC points to the latest DHCP option set and your DNS server will not be able to resolve these DNS names from your VPC.
    nslookup <broker 1 DNS name>

Now you can log in to the DNS server and update the records for different brokers and respective IP addresses in the /var/named/kafka.region.amazonaws.com file.

  1. Upload the msk-access.pem file to BastionHostInstance from your local machine:
    scp -i "< your pem file>" Your pem file ec2-user@<BastionHostInstance IP address>:/home/ec2-user/

  2. Log in to the DNS server and open the /var/named/kafka.region.amazonaws.com file and update the following lines with the correct MSK broker DNS names and respective IP addresses:
    <b-1.<clustername>.******.c6> IN A <Internal IP Address - broker 1>
    <b-2.<clustername>.******.c6> IN A <Internal IP Address - broker 2>
    <b-3.<clustername>.******.c6> IN A <Internal IP Address - broker 3>
    

Note that you need to provide the broker DNS as mentioned earlier. Remove .kafka.<region id>.amazonaws.com from the broker DNS name.

  1. Restart the DNS service:
    sudo su
    service named restart

You should see the following message:

Redirecting to /bin/systemctl restart named.service

Your custom DNS server is up and running now and you should be able to resolve using broker DNS names using the internal DNS server.

Update the security group for connectivity between the MySQL database and MSK Connect

It’s important to have the appropriate connectivity in place between MSK Connect and the MySQL database. Complete the following steps:

  1. On the Amazon MSK console, navigate to the MSK cluster and under Network settings, copy the security group.
  2. On the Amazon EC2 console, choose Security groups in the navigation pane.
  3. Edit the security group MySQL_SG and choose Add rule.
  4. Add a rule with MySQL/Aurora as the type and the MSK security group as the inbound resource for its source.
  5. Choose Save rules.

Create the MSK connector

To create your MSK connector, complete the following steps:

  1. On the Amazon MSK console, choose Connectors under MSK Connect in the navigation pane.
  2. Choose Create connector.
  3. Select Create custom plugin.
  4. Download the MySQL connector plugin for the latest stable release from the Debezium site or download Debezium.zip.
  5. Upload the MySQL connector zip file to the S3 bucket.
  6. Copy the URL for the file, such as s3://<bucket name>/Debezium.zip.
  7. Return to the Choose custom plugin page and enter the S3 file path for S3 URI.
  8. For Custom plugin name, enter mysql-plugin.
  9. Choose Next.
  10. For Name, enter mysql-connector.
  11. For Description, enter a description of the connector.
  12. For Cluster type, choose MSK Cluster.
  13. Select the existing cluster from the list (for this post, mkc-tutorial-cluster).
  14. Specify the authentication type as IAM.
  15. Use the following values for Connector configuration:
    connector.class=io.debezium.connector.mysql.MySqlConnector
    database.history.producer.sasl.mechanism=AWS_MSK_IAM
    database.history.producer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
    database.allowPublicKeyRetrieval=true
    database.user=master
    database.server.id=123456
    tasks.max=1
    database.history.consumer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
    database.history.producer.security.protocol=SASL_SSL
    database.history.kafka.topic=dbhistory.salesdb
    database.history.kafka.bootstrap.servers=b-3.xxxxx.yyyy.zz.kafka.us-east-2.amazonaws.com:9098,b-1.xxxxx.yyyy.zz.kafka.us-east-2.amazonaws.com:9098,b-2. xxxxx.yyyy.zz.kafka.us-east-2.amazonaws.com:9098
    database.server.name=salesdb-server
    database.history.producer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
    database.history.consumer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
    database.history.consumer.security.protocol=SASL_SSL
    database.port=3306
    include.schema.changes=true
    database.hostname=local.mysql.internal
    database.password=xxxxxx
    table.include.list=salesdb.SALES_ORDER,salesdb.SALES_ORDER_ALL,salesdb.CUSTOMER
    database.history.consumer.sasl.mechanism=AWS_MSK_IAM
    database.include.list=salesdb
    

  16. Update the following connector configuration:
    database.user=master
    database.hostname=local.mysql.internal
    database.password=<MySQLMasterUserPassword>
    

  17. For Capacity type, choose Provisioned.
  18. For MCU count per worker, enter 1.
  19. For Number of workers, enter 1.
  20. Select Use the MSK default configuration.
  21. In the Access Permissions section, on the Choose service role menu, choose MSK-Connect-PrivateDNS-MySQLConnector*, then choose Next.
  22. In the Security section, keep the default settings.
  23. In the Logs section, select Deliver to Amazon CloudWatch logs.
  24. Choose visit Amazon CloudWatch Logs console.
  25. Under Logs in the navigation pane, choose Log group.
  26. Choose Create log group.
  27. Enter the log group name, retention settings, and tags, then choose Create.
  28. Return to the connector creation page and choose Browse log group.
  29. Choose the AmazonMSKConnect log group, then choose Next.
  30. Review the configurations and choose Create connector.

Wait for the connector creation process to complete (about 10–15 minutes).

The MSK Connect connector is now up and running. You can log in to the MySQL database using your user ID and make a couple of record changes to the customer table record. MSK Connect will be able to receive CDC records and updates to the database will be available in the MSK <Customer> topic.

Consume messages from the MSK topic

To consume messages from the MSK topic, run the Kafka consumer on the MSK_Client EC2 instance available in the MSK VPC.

  1. SSH to the MSK_Client EC2 instance. The MSK_Client instance has the required Kafka client libraries, Amazon MSK IAM JAR file, client.properties file, and an instance profile attached to it, along with the appropriate IAM role using the CloudFormation template.
  2. Add the MSKClientSG security group as the source for the MSK security group with the following properties:
    • For Type, choose All Traffic.
    • For Source, choose Custom and MSK Security Group.

    Now you’re ready to consume data.

  3. To list the topics, run the following command:
    ./kafka-topics.sh --bootstrap-server <BootstrapServerString>

  4. To consume data from the salesdb-server.salesdb.CUSTOMER topic, use the following command:
    ./kafka-console-consumer.sh --bootstrap-server <BootstrapServerString> --consumer.config client.properties --topic salesdb-server.salesdb.CUSTOMER --from-beginning

Run the Kafka consumer on your EC2 machine and you will be able to log messages similar to the following:

Struct{after=Struct{CUST_ID=1998.0,NAME=Customer Name 1998,MKTSEGMENT=Market Segment 3},source=Struct{version=1.9.5.Final,connector=mysql,name=salesdb-server,ts_ms=1678099992174,snapshot=true,db=salesdb,table=CUSTOMER,server_id=0,file=binlog.000001,pos=43298383,row=0},op=r,ts_ms=1678099992174}
Struct{after=Struct{CUST_ID=1999.0,NAME=Customer Name 1999,MKTSEGMENT=Market Segment 7},source=Struct{version=1.9.5.Final,connector=mysql,name=salesdb-server,ts_ms=1678099992174,snapshot=true,db=salesdb,table=CUSTOMER,server_id=0,file=binlog.000001,pos=43298383,row=0},op=r,ts_ms=1678099992174}
Struct{after=Struct{CUST_ID=2000.0,NAME=Customer Name 2000,MKTSEGMENT=Market Segment 9},source=Struct{version=1.9.5.Final,connector=mysql,name=salesdb-server,ts_ms=1678099992174,snapshot=last,db=salesdb,table=CUSTOMER,server_id=0,file=binlog.000001,pos=43298383,row=0},op=r,ts_ms=1678099992174}
Struct{before=Struct{CUST_ID=2000.0,NAME=Customer Name 2000,MKTSEGMENT=Market Segment 9},after=Struct{CUST_ID=2000.0,NAME=Customer Name 2000,MKTSEGMENT=Market Segment10},source=Struct{version=1.9.5.Final,connector=mysql,name=salesdb-server,ts_ms=1678100372000,db=salesdb,table=CUSTOMER,server_id=1,file=binlog.000001,pos=43298616,row=0,thread=67},op=u,ts_ms=1678100372612}

While testing the application, records with CUST_ID 1998, 1999, and 2000 were updated, and these records are available in the logs.

Clean up

It’s always a good practice to clean up all the resources created as part of this post to avoid any additional cost. To clean up your resources, delete the MSK Cluster, MSK Connect connection, EC2 instances, DNS server, bastion host, S3 bucket, VPC, subnets and CloudWatch logs.

Additionally, clean up all other AWS resources that you created using AWS CloudFormation. You can delete these resources on the AWS CloudFormation console by deleting the stack.

Conclusion

In this post, we discussed the process of setting up MSK Connect using a private DNS. This feature allows you to configure connectors to reference public or private domain names.

We are able to receive the initial load and CDC records from a MySQL database hosted in a separate VPC and its DNS is not accessible or resolvable externally. MSK Connect was able to connect to the MySQL database and consume the records using the MSK Connect private DNS feature. The custom DHCP option set was attached to the VPC, which ensured DNS resolution was performed using the local DNS server instead of Route 53.

With the MSK Connect private DNS support feature, you can make your databases, data warehouses, and systems like secret managers that work with your own VPC inaccessible to the internet and be able to overcome this limitation and comply with your corporate security posture.

To learn more and get started, refer to private DNS for MSK connect.


About the author

Amar is a Senior Solutions Architect at Amazon AWS in the UK. He works across power, utilities, manufacturing and automotive customers on strategic implementations, specializing in using AWS Streaming and advanced data analytics solutions, to drive optimal business outcomes.

SmugMug’s durable search pipelines for Amazon OpenSearch Service

Post Syndicated from Lee Shepherd original https://aws.amazon.com/blogs/big-data/smugmugs-durable-search-pipelines-for-amazon-opensearch-service/

SmugMug operates two very large online photo platforms, SmugMug and Flickr, enabling more than 100 million customers to safely store, search, share, and sell tens of billions of photos. Customers uploading and searching through decades of photos helped turn search into critical infrastructure, growing steadily since SmugMug first used Amazon CloudSearch in 2012, followed by Amazon OpenSearch Service since 2018, after reaching billions of documents and terabytes of search storage.

Here, Lee Shepherd, SmugMug Staff Engineer, shares SmugMug’s search architecture used to publish, backfill, and mirror live traffic to multiple clusters. SmugMug uses these pipelines to benchmark, validate, and migrate to new configurations, including Graviton based r6gd.2xlarge instances from i3.2xlarge, along with testing Amazon OpenSearch Serverless. We cover three pipelines used for publishing, backfilling, and querying without introducing spiky unrealistic traffic patterns, and without any impact on production services.

There are two main architectural pieces critical to the process:

  • A durable source of truth for index data. It’s best practice and part of our backup strategy to have a durable store beyond the OpenSearch index, and Amazon DynamoDB provides scalability and integration with AWS Lambda that simplifies a lot of the process. We use DynamoDB for other non-search services, so this was a natural fit.
  • A Lambda function for publishing data from the source of truth into OpenSearch. Using function aliases helps run multiple configurations of the same Lambda function at the same time and is key to keeping data in sync.

Publishing

The publishing pipeline is driven from events like a user entering keywords or captions, new uploads, or label detection through Amazon Rekognition. These events are processed, combining data from a few other asset stores like Amazon Aurora MySQL Compatible Edition and Amazon Simple Storage Service (Amazon S3), before writing a single item into DynamoDB.

Writing to DynamoDB invokes a Lambda publishing function, through the DynamoDB Streams Kinesis Adapter, that takes a batch of updated items from DynamoDB and indexes them into OpenSearch. There are other benefits to using the DynamoDB Streams Kinesis Adapter such as reducing the number of concurrent Lambdas required.

The publishing Lambda function uses environment variables to determine what OpenSearch domain and index to publish to. A production alias is configured to write to the production OpenSearch domain, off of the DynamoDB table or Kinesis Stream

When testing new configurations or migrating, a migration alias is configured to write to the new OpenSearch domain but use the same trigger as the production alias. This enables dual indexing of data to both OpenSearch Service domains simultaneously.

Here’s an example of the DynamoDB table schema:

 "Id": 123456,  // partition key
 "Fields": {
  "format": "JPG",
  "height": 1024,
  "width": 1536,
  ...
 },
 "LastUpdated": 1600107934,

The ‘LastUpdated’ value is used as the document version when indexing, allowing OpenSearch to reject any out-of-order updates.

Backfilling

Now that changes are being published to both domains, the new domain (index) needs to be backfilled with historical data. To backfill a newly created index, a combination of Amazon Simple Queue Service (Amazon SQS) and DynamoDB is used. A script populates an SQS queue with messages that contain instructions for parallel scanning a segment of the DynamoDB table.

The SQS queue launches a Lambda function that reads the message instructions, fetches a batch of items from the corresponding segment of the DynamoDB table, and writes them into an OpenSearch index. New messages are written to the SQS queue to keep track of progress through the segment. After the segment completes, no more messages are written to the SQS queue and the process stops itself.

Concurrency is determined by the number of segments, with additional controls provided by Lambda concurrency scaling. SmugMug is able to index more than 1 billion documents per hour on their OpenSearch configuration while incurring zero impact to the production domain.

A NodeJS AWS-SDK based script is used to seed the SQS queue. Here’s a snippet of the SQS configuration script’s options:

Usage: queue_segments [options]

Options:
--search-endpoint <url>  OpenSearch endpoint url
--sqs-url <url>          SQS queue url
--index <string>         OpenSearch index name
--table <string>         DynamoDB table name
--key-name <string>      DynamoDB table partition key name
--segments <int>         Number of parallel segments

Along with the format of the resulting SQS message:

{
  searchEndpoint: opts.searchEndpoint,
  sqsUrl: opts.sqsUrl,
  table: opts.table,
  keyName: opts.keyName,
  index: opts.index,
  segment: i,
  totalSegments: opts.segments,
  exclusiveStartKey: <lastEvaluatedKey from previous iteration>
}

As each segment is processed, the ‘lastEvaluatedKey’ from the previous iteration is added to the message as the ‘exclusiveStartKey’ for the next iteration.

Mirroring

Last, our mirrored search query results run by sending an OpenSearch query to an SQS queue, in addition to our production domain. The SQS queue launches a Lambda function that replays the query to the replica domain. The search results from these requests are not sent to any user but allow replicating production load on the OpenSearch service under test without impact to production systems or customers.

Conclusion

When evaluating a new OpenSearch domain or configuration, the main metrics we are interested in are query latency performance, namely the took latencies (latencies per time), and most importantly latencies for searching. In our move to Graviton R6gd, we saw about 40 percent lower P50-P99 latencies, along with similar gains in CPU usage compared to i3’s (ignoring Graviton’s lower costs). Another welcome benefit was the more predictable and monitorable JVM memory pressure with the garbage collection changes from the addition of G1GC on R6gd and other new instances.

Using this pipeline, we’re also testing OpenSearch Serverless and finding its best use-cases. We’re excited about that service and fully intend to have an entirely serverless architecture in time. Stay tuned for results.


About the Authors

Lee Shepherd is a SmugMug Staff Software Engineer

Aydn Bekirov is an Amazon Web Services Principal Technical Account Manager

Maintaining a local copy of your data in AWS Local Zones

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/maintaining-a-local-copy-of-your-data-in-aws-local-zones/

This post is written by Leonardo Solano, Senior Hybrid Cloud SA and Obed Gutierrez, Solutions Architect, Enterprise.

This post covers data replication strategies to back up your data into AWS Local Zones. These strategies include database replication, file based and object storage replication, and partner solutions for Amazon Elastic Compute Cloud (Amazon EC2).

Customers running workloads in AWS Regions are likely to require a copy of their data in their operational location for either their backup strategy or data residency requirements. To help with these requirements, you can use Local Zones.

Local Zones is an AWS infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers. With Local Zones, customers can build and deploy workloads to comply with state and local data residency requirements in sectors such as healthcare, financial services, gaming, and government.

Solution overview

This post assumes the database source is Amazon Relational Database Service (Amazon RDS). To backup an Amazon RDS database to Local Zones, there are three options:

  1. AWS Database Migration Service (AWS DMS)
  2. AWS DataSync
  3. Backup to Amazon Simple Storage Service (Amazon S3)

. Amazon RDS replication to Local Zones with AWS DMS

Figure 1. Amazon RDS replication to Local Zones with AWS DMS

To replicate data, AWS DMS needs a source and a target database. The source database should be your existing Amazon RDS database. The target database is placed in an EC2 instance in the Local Zone. A replication job is created in AWS DMS, which maintains the source and target databases in sync. The replicated database in the Local Zone can be accessed through a VPN. Your database administrator can directly connect to the database engine with your preferred tool.

With this architecture, you can maintain a locally accessible copy of your databases, allowing you to comply with regulatory requirements.

Prerequisites

The following prerequisites are required before continuing:

  • An AWS Account with Administrator permissions;
  • Installation of the latest version of AWS Command Line Interface (AWS CLI v2);
  • An Amazon RDS database.

Walkthrough

1. Enabling Local Zones

First, you must enable Local Zones. Make sure that the intended Local Zone is parented to the AWS Region where the environment is running. Edit the commands to match your parameters, group-name makes reference to your local zone group and region to the region identifier to use.

aws ec2 modify-availability-zone-group \
  --region us-east-1 \
  --group-name us-east-1-qro-1\
  --opt-in-status opted-in

If you have an error when calling the ModifyAvailabilityZoneGroup operation, you must sign up for the Local Zone.

After enabling the Local Zone, you must extend the VPC to the Local Zone by creating a subnet in the Local Zone:

aws ec2 create-subnet \
  --region us-east-1 \
  --availability-zone us-east-1-qro-1a \
  --vpc-id vpc-02a3eb6585example \
  --cidr-block my-subnet-cidr

If you need a step-by-step guide, refer to Getting started with AWS Local Zones. Enabling Local Zones is free of charge. Only deployed services in the Local Zone incur billing.

2. Set up your target database

Now that you have the Local Zone enabled with a subnet, set up your target database instance in the Local Zone subnet that you just created.

You can use AWS CLI to launch it as an EC2 instance:

aws ec2 run-instances \
  --region us-east-1 \
  --subnet-id subnet-08fc749671example \
  --instance-type t3.medium \
  --image-id ami-0abcdef123example \
  --security-group-ids sg-0b0384b66dexample \
  --key-name my-key-pair

You can verify that your EC2 instance is running with the following command:

aws ec2 describe-instances --filters "Name=availability-zone,Values=us-east-1-qro-1a" --query "Reservations[].Instances[].InstanceId"

Output:

 $ ["i-0cda255374example"]

Note that not all instance types are available in Local Zones. You can verify it with the following AWS CLI command:

aws ec2 describe-instance-type-offerings --location-type "availability-zone" \
--filters Name=location,Values=us-east-1-qro-1a --region us-east-1

Once you have your instance running in the Local Zone, you can install the database engine matching your source database. Here is an example of how to install MariaDB:

  1. Updates all packages to the latest OS versionsudo yum update -y
  2. Install MySQL server on your instance, this also creates a systemd servicesudo yum install -y mariadb-server
  3. Enable the service created in previous stepsudo systemctl enable mariadb
  4. Start the MySQL server service on your Amazon Linux instancesudo systemctl start mariadb
  5. Set root user password and improve your DB securitysudo mysql_secure_installation

You can confirm successful installation with these commands:

mysql -h localhost -u root -p
SHOW DATABASES;

3. Configure databases for replication

In order for AWS DMS to replicate ongoing changes, you must use change data capture (CDC), as well as set up your source and target database accordingly before replication:

Source database:

  • Make sure that the binary logs are available to AWS DMS:

 call mysql.rds_set_configuration('binlog retention hours', 24);

  • Set the binlog_format parameter to “ROW“.
  • Set the binlog_row_image parameter to “Full“.
  • If you are using Read replica as source, then set the log_slave_updates parameter to TRUE.

For detailed information, refer to Using a MySQL-compatible database as a source for AWS DMS, or sources for your migration if your database engine is different.

Target database:

  • Create a user for AWS DMS that has read/write privileges to the MySQL-compatible database. To create the necessary privileges, run the following commands.
CREATE USER ''@'%' IDENTIFIED BY '';
GRANT ALTER, CREATE, DROP, INDEX, INSERT, UPDATE, DELETE, SELECT ON .* TO 
''@'%';
GRANT ALL PRIVILEGES ON awsdms_control.* TO ''@'%';
  • Disable foreign keys on target tables, by adding the next command in the Extra connection attributes section of the AWS DMS console for your target endpoint.

Initstmt=SET FOREIGN_KEY_CHECKS=0;

  • Set the database parameter local_infile = 1 to enable AWS DMS to load data into the target database.

4. Set up AWS DMS

Now that you have our Local Zone enabled with the target database ready and the source database configured, you can set up AWS DMS Replication instance.

Go to AWS DMS in the AWS Management Console, and under Migrate data select Replication Instances, then select the Create Replication button:

This shows the Create replication Instance, where you should fill up the parameters required:

Note that High Availability is set to Single-AZ, as this is a test workload, while Multi-AZ is recommended for Production workloads.

Refer to the AWS DMS replication instance documentation for details about how to size your replication instance.

Important note

To allow replication, make sure that you set up the replication instance in the VPC that your environment is running, and configure security groups from and to the source and target database.

Now you can create the DMS Source and Target endpoints:

5. Set up endpoints

Source endpoint:

In the AWS DMS console, select Endpoints, select the Create endpoint button, and select Source endpoint option. Then, fill the details required:

Make sure you select your RDS instance as Source by selecting the check box as show in the preceding figure. Moreover, include access to endpoint database details, such as user and password.

You can test your endpoint connectivity before creating it, as shown in the following figure:

If your test is successful, then you can select the Create endpoint button.

Target endpoint:

In the same way as the Source in the console, select Endpoints, select the Create endpoint button, and select Target endpoint option, then enter the details required, as shown in the following figure:

In the Access to endpoint database section, select Provide access information manually option, next add your Local Zone target database connection details as shown below. Notice that Server name value, should be the IP address of your target database.

Make sure you go to the bottom of the page and configure Extra connection attributes in the Endpoint settings, as described in the Configure databases for replication section of this post:

Like the source endpoint, you can test your endpoint connection before creating it.

6. Create the replication task

Once the endpoints are ready, you can create the migration task to start the replication. Under the Migrate Data section, select Database migration tasks, hit the Create task button, and configure your task:

Select Migrate existing data and replicate ongoing changes in the Migration type parameter.

Enable Task logs under Task Settings. This is recommended as it can help you with troubleshooting purposes.

In Table mappings, include the schema you want to replicate to the Local Zone database:

Once you have defined Task Configuration, Task Settings, and Table Mappings, you can proceed to create your database migration task.

This will trigger your migration task. Now wait until the migration task completes successfully.

7. Validate replicated database

After the replication job completes the Full Load, proceed to validate at your target database. Connect to your target database and run the following commands:

USE example;
SHOW TABLES;

As a result you should see the same tables as the source database.

MySQL [example]> SHOW TABLES;
+----------------------------+
| Tables_in_example          |
+----------------------------+
| actor                      |
| address                    |
| category                   |
| city                       |
| country                    |
| customer                   |
| customer_list              |
| film                       |
| film_actor                 |
| film_category              |
| film_list                  |
| film_text                  |
| inventory                  |
| language                   |
| nicer_but_slower_film_list |
| payment                    |
| rental                     |
| sales_by_film_category     |
| sales_by_store             |
| staff                      |
| staff_list                 |
| store                      |
+----------------------------+
22 rows in set (0.06 sec)

If you get the same tables from your source database, then congratulations, you’re set! Now you can maintain and navigate a live copy of database in the Local Zone for data residency purposes.

Clean up

When you have finished this tutorial, you can delete all the resources that have been deployed. You can do this in the Console or by running the following commands in the AWS CLI:

  1. Delete target DB:
    aws ec2 terminate-instances --instance-ids i-abcd1234
  2. Decommision AWS DMS
    • Replication Task:
      aws dms delete-replication-task --replication-task-arn arn:aws:dms:us-east-1:111111111111:task:K55IUCGBASJS5VHZJIIEXAMPLE
    • Endpoints:
      aws dms delete-endpoint --endpoint-arn arn:aws:dms:us-east-1:111111111111:endpoint:OUJJVXO4XZ4CYTSEG5XEXAMPLE
    • Replication instance:
      aws dms delete-replication-instance --replication-instance-arn us-east-1:111111111111:rep:T3OM7OUB5NM2LCVZF7JEXAMPLE
  3. Delete Local Zone subnet
    aws ec2 delete-subnet --subnet-id subnet-9example

Conclusion

Local Zones is a useful tool for running applications with low latency requirements or data residency regulations. In this post, you have learned how to use AWS DMS to seamlessly replicate your data to Local Zones. With this architecture you can efficiently maintain a local copy of your data in Local Zones and access it securly.

If you are interested on how to automate your workloads deployments in Local Zones, make sure you check this workshop.

Enabling highly available connectivity from on premises to AWS Local Zones

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-highly-available-connectivity-from-on-premises-to-aws-local-zones/

This post is written by Leonardo Solano, Senior Hybrid Cloud SA and Robert Belson SA Developer Advocate.

Planning your network topology is a foundational requirement of the reliability pillar of the AWS Well-Architected Framework. REL02-BP02 defines how to provide redundant connectivity between private networks in the cloud and on-premises environments using AWS Direct Connect for resilient, redundant connections using AWS Site-to-Site VPN, or AWS Direct Connect failing over to AWS Site-to-Site VPN. As more customers use a combination of on-premises environments, Local Zones, and AWS Regions, they have asked for guidance on how to extend this pillar of the AWS Well-Architected Framework to include Local Zones. As an example, if you are on an application modernization journey, you may have existing Amazon EKS clusters that have dependencies on persistent on-premises data.

AWS Local Zones enables single-digit millisecond latency to power applications such as real-time gaming, live streaming, augmented and virtual reality (AR/VR), virtual workstations, and more. Local Zones can also help you meet data sovereignty requirements in regulated industries  such as healthcare, financial services, and the public sector. Additionally, enterprises can leverage a hybrid architecture and seamlessly extend their on-premises environment to the cloud using Local Zones. In the example above, you could extend Amazon EKS clusters to include node groups in a Local Zone (or multiple Local Zones) or on premises using AWS Outpost rack.

To provide connectivity between private networks in Local Zones and on-premises environments, customers typically consider Direct Connect or software VPNs available in the AWS Marketplace. This post provides a reference implementation to eliminate single points of failure in connectivity while offering automatic network impairment detection and intelligent failover using both Direct Connect and software VPNs in AWS Market place. Moreover, this solution minimizes latency by ensuring traffic does not hairpin through the parent AWS Region to the Local Zone.

Solution overview

In Local Zones, all architectural patterns based on AWS Direct Connect follow the same architecture as in AWS Regions and can be deployed using the AWS Direct Connect Resiliency Toolkit. As of the date of publication, Local Zones do not support AWS managed Site-to-Site VPN (view latest Local Zones features). Thus, for customers that have access to only a single Direct Connect location or require resiliency beyond a single connection, this post will demonstrate a solution using an AWS Direct Connect failover strategy with a software VPN appliance. You can find a range of third-party software VPN appliances as well as the throughput per VPN tunnel that each offering provides in the AWS Marketplace.

Prerequisites:

To get started, make sure that your account is opt-in for Local Zones and configure the following:

  1. Extend a Virtual Private Cloud (VPC) from the Region to the Local Zone, with at least 3 subnets. Use Getting Started with AWS Local Zones as a reference.
    1. Public subnet in Local Zone (public-subnet-1)
    2. Private subnets in Local Zone (private-subnet-1 and private-subnet-2)
    3. Private subnet in the Region (private-subnet-3)
    4. Modify DNS attributes in your VPC, including both “enableDnsSupport” and “enableDnsHostnames”;
  2. Attach an Internet Gateway (IGW) to the VPC;
  3. Attach a Virtual Private Gateway (VGW) to the VPC;
  4. Create an ec2 vpc-endpoint attached to the private-subnet-3;
  5. Define the following routing tables (RTB):
    1. Private-subnet-1 RTB: enabling propagation for VGW;
    2. Private-subnet-2 RTB: enabling propagation for VGW;
    3. Public-subnet-1 RTB: with a default route with IGW-ID as the next hop;
  6. Configure a Direct Connect Private Virtual Interface (VIF) from your on-premises environment to Local Zones Virtual Gateway’s VPC. For more details see this post: AWS Direct Connect and AWS Local Zones interoperability patterns;
  7. Launch any software VPN appliance from AWS Marketplace on Public-subnet-1. In this blog post on simulating Site-to-Site VPN customer gateways using strongSwan, you can find an example that provides the steps to deploy a third-party software VPN in AWS Region;
  8. Capture the following parameters from your environment:
    1. Software VPN Elastic Network Interface (ENI) ID
    2. Private-subnet-1 RTB ID
    3. Probe IP, which must be an on-premises resource that can respond to Internet Control Message Protocol (ICMP) requests.

High level architecture

This architecture requires a utility Amazon Elastic Compute Cloud (Amazon EC2) instance in a private subnet (private-subnet-2), sending ICMP probes over the Direct Connect connection. Once the utility instance detects lost packets to on-premises network from the Local Zone it initiates a failover by adding a static route with the on-premises CIDR range as the destination and the VPN Appliance ENI-ID as the next hop in the production private subnet (private-subnet-1), taking priority over the Direct Connect propagated route. Once healthy, this utility will revert back to the default route to the original Direct Connect connection.

On-premises considerations

To add redundancy in the on-premises environment, you can use two routers using any First Hop Redundancy Protocol (FHRP) as Hot Standby Router Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP). The router connected to the Direct Connect link has the highest priority, taking the Primary role in the FHRP process while the VPN router remain the Secondary router. The failover mechanism in the FHRP relies on interface or protocol state as BGP, which triggers the failover mechanism.

High level HA architecture for Software VPN

Figure 1. High level HA architecture for Software VPN

Failover by modifying the production subnet RTB

Figure 2. Failover by modifying the production subnet RTB

Step-by-step deployment

Create IAM role with permissions to create and delete routes in your private-subnet-1 route table:

  1. Create ec2-role-trust-policy.json file on your local machine:
cat > ec2-role-trust-policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
  1. Create your EC2 IAM role, such as my_ec2_role:
aws iam create-role --role-name my_ec2_role --assume-role-policy-document file://ec2-role-trust-policy.json
  1. Create a file with the necessary permissions to attach to the EC2 IAM role. Name it ec2-role-iam-policy.json.
aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json
  1. Create the IAM policy and attach the policy to the IAM role my_ec2_role that you previously created:
aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json

aws iam attach-role-policy --policy-arn arn:aws:iam::<account_id>:policy/my-ec2-policy --role-name my_ec2_role
  1. Create an instance profile and attach the IAM role to it:
aws iam create-instance-profile –instance-profile-name my_ec2_instance_profile
aws iam add-role-to-instance-profile –instance-profile-name my_ec2_instance_profile –role-name my_ec2_role   

Launch and configure your utility instance

  1. Capture the Amazon Linux 2 AMI ID through CLI:
aws ec2 describe-images --filters "Name=name,Values=amzn2-ami-kernel-5.10-hvm-2.0.20230404.1-x86_64-gp2" | grep ImageId 

Sample output:

            "ImageId": "ami-069aabeee6f53e7bf",

  1. Create an EC2 key for the utility instance:
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
  1. Launch the utility instance in the Local Zone (replace the variables with your account and environment parameters):
aws ec2 run-instances --image-id ami-069aabeee6f53e7bf --key-name MyKeyPair --count 1 --instance-type t3.medium  --subnet-id <private-subnet-2-id> --iam-instance-profile Name=my_ec2_instance_profile_linux

Deploy failover automation shell script on the utility instance

  1. Create the following shell script in your utility instance (replace the health check variables with your environment values):
cat > vpn_monitoring.sh <<EOF
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT-0
# Health Check variables
Wait_Between_Pings=2
RTB_ID=<private-subnet-1-rtb-id>
PROBE_IP=<probe-ip>
remote_cidr=<remote-cidr>
GW_ENI_ID=<software-vpn-eni_id>
Active_path=DX

echo `date` "-- Starting VPN monitor"

while [ . ]; do
  # Check health of main VPN Appliance path to remote probe ip
  pingresult=`ping -c 3 -W 1 $PROBE_IP | grep time= | wc -l`
  # Check to see if any of the health checks succeeded
  if ["$pingresult" == "0"]; then
    if ["$Active_path" == "DX"]; then
      echo `date` "-- Direct Connect failed. Failing over vpn"
      aws ec2 create-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --network-interface-id $GW_ENI_ID --region us-east-1
      Active_path=VPN
      DX_tries=10
      echo "probe_ip: unreachable – active_path: vpn"
    else
      echo "probe_ip: unreachable – active_path: vpn"
    fi
  else     
    if ["$Active_path" == "VPN"]; then
      let DX_tries=DX_tries-1
      if ["$DX_tries" == "0"]; then
        echo `date` "-- failing back to Direct Connect"
        aws ec2 delete-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --region us-east-1
        Active_path=DX
        echo "probe_ip: reachable – active_path: Direct Connect"
      else
        echo "probe_ip: reachable – active_path: vpn"
      fi
    else
      echo "probe:ip: reachable – active_path: Direct Connect"	    
    fi
  fi    
done EOF
  1. Modify permissions to your shell script file:
chmod +x vpn_monitoring.sh
  1. Start the shell script:
./vpn_monitoring.sh

Test the environment

Failover process between Direct Connect and software VPN

Figure 3. Failover process between Direct Connect and software VPN

Simulate failure of the Direct Connect link, breaking the available path from the Local Zone to the on-premises environment. You can simulate the failure using the failure test feature in Direct Connect console.

Bringing BGP session down

Figure 4. Bringing BGP session down

Setting the failure time

Figure 5. Setting the failure time

In the utility instance you will see the following logs:

Thu Sep 21 14:39:34 UTC 2023 -- Direct Connect failed. Failing over vpn

The shell script in action will detect packet loss by ICMP probes against a probe IP destination on premises, triggering the failover process. As a result, it will make an API call (aws ec2 create-route) to AWS using the EC2 interface endpoint.

The script will create a static route in the private-subnet-1-RTB toward on-premises CIDR with the VPN Elastic-Network ID as the next hop.

private-subnet-1-RTB during the test

Figure 6. private-subnet-1-RTB during the test

The FHRP mechanisms detect the failure in the Direct Connect Link and then reduce the FHRP priority on this path, which triggers the failover to the secondary link through the VPN path.

Once you cancel the test or the test finishes, the failback procedure will revert the private-subnet-1 route table to its initial state, resulting in the following logs to be emitted by the utility instance:

Thu Sep 21 14:42:34 UTC 2023 -- failing back to Direct Connect

private-subnet-1 route table initial state

Figure 7. private-subnet-1 route table initial state

Cleaning up

To clean up your AWS based resources, run following AWS CLI commands:

aws ec2 terminate-instances --instance-ids <your-utility-instance-id>
aws iam delete-instance-profile --instance-profile-name my_ec2_instance_profile
aws iam delete-role my_ec2_role

Conclusion

This post demonstrates how to create a failover strategy for Local Zones using the same resilience mechanisms already established in the AWS Regions. By leveraging Direct Connect and software VPNs, you can achieve high availability in scenarios where you are constrained to a single Direct Connect location due to geographical limitations. In the architectural pattern illustrated in this post, the failover strategy relies on a utility instance with least-privileged permissions. The utility instance identifies network impairment and dynamically modify your production route tables to keep the connectivity established from a Local Zone to your on-premises location. This same mechanism provides capabilities to automatically failback from the software VPN to Direct Connect once the utility instance validates that the Direct Connect Path is sufficiently reliable to avoid network flapping. To learn more about Local Zones, you can visit the AWS Local Zones user guide.

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/migrate-microsoft-azure-synapse-analytics-to-amazon-redshift-using-aws-sct/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. With Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers, such as the following:

  • Amazon Redshift seamlessly integrates with broader data, analytics, and AI or machine learning (ML) services on AWS, enabling you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning, and more.
  • You don’t need to worry about workloads such as ETL (extract, transform, and load), dashboards, ad-hoc queries, and so on interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get 1 hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is straightforward to use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault tolerance is built in. All data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, serverless web services or event-driven applications. You can also use your favorite business intelligence (BI) and SQL tools to access, analyze, and visualize data in Amazon Redshift.
  • Amazon Redshift ML makes it straightforward for data scientists to create, train, and deploy ML models using familiar SQL. You can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row-level and column-level security controls on sensitive data, and more.

In this post, we show how to migrate a data warehouse from Microsoft Azure Synapse to Redshift Serverless using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database code and storage objects to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your Azure Synapse Analytics. First, we create a Redshift database into which Azure Synapse data will be migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert Azure Synapse schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from Azure Synapse, upload it into an S3 bucket, and copy it to Amazon Redshift.

The following diagram illustrates our solution architecture.

This walkthrough covers the following steps:

  1. Create a Redshift Serverless data warehouse.
  2. Create the S3 bucket and folder.
  3. Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT:
    1. Connect to the Azure Synapse source.
    2. Connect to the Amazon Redshift target.
    3. Convert the Azure Synapse schema to a Redshift database.
    4. Analyze the assessment report and address the action items.
    5. Apply the converted schema to the target Redshift database.
  4. Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents:
    1. Generate trust and key stores (this step is optional).
    2. Install and configure the data extraction agent.
    3. Start the data extraction agent.
    4. Register the data extraction agent.
    5. Add virtual partitions for large tables (this step is optional).
    6. Create a local data migration task.
    7. Start the local data migration task.
  5. View data in Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

Create a Redshift Serverless data warehouse

In this step, we create a Redshift Serverless data warehouse with a workgroup and namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users. To isolate workloads and manage different resources in Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create a Redshift Serverless data warehouse with a workgroup and namespace:

  1. On the Amazon Redshift console, choose the AWS Region that you want to use.
  2. In the navigation pane, choose Redshift Serverless.
  3. Choose Create workgroup.

  1. For Workgroup name, enter a name that describes the compute resources.

  1. Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  2. Choose Next.
  3. For Namespace, enter a name that describes your dataset.
  4. In the Database name and password section, select Customize admin user credentials.
  5. For Admin user name, enter a user name of your choice (for example, awsuser).
  6. For Admin user password, enter a password of your choice (for example, MyRedShiftPW2022).

  1. Choose Next.

Note that data in the Redshift Serverless namespace is encrypted by default.

  1. In the Review and Create section, choose Create.

Now you create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace. Note that there can only be one default IAM role.

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the namespace you just created.
  2. On the Security and encryption tab, in the Permissions section, choose Manage IAM roles.
  3. Choose Manage IAM roles and choose Create IAM role.
  4. In the Specify an Amazon S3 bucket for the IAM role to access section, choose one of the following methods:
    1. Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with names containing the word redshift.
    2. Choose Any Amazon S3 bucket to allow the created IAM role to access all S3 buckets.
    3. Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
  5. Choose Create IAM role as default.
  6. Capture the endpoint for the Redshift Serverless workgroup you just created.
  7. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workgroup you just created.
  8. In the General information section, copy the endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create an S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a unique DNS-compliant name for your bucket (for example, uniquename-as-rs).

For more information about bucket names, refer to Bucket naming rules.

  1. For AWS Region, choose the Region in which you created the Redshift Serverless workgroup.
  2. Choose Create bucket.

  1. Choose Buckets in the navigation pane and navigate to the S3 bucket you just created (uniquename-as-rs).
  2. Choose Create folder.
  3. For Folder name, enter incoming.
  4. Choose Create folder.

Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT

To convert the Azure Synapse schema to Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that you created previously and launch AWS SCT.

Connect to the Azure Synapse source

Complete the following steps to connect to the Azure Synapse source:

  1. On the File menu, choose Create New Project.
  2. Choose a location to store your project files and data.
  3. Provide a meaningful but memorable name for your project (for example, Azure Synapse to Amazon Redshift).
  4. To connect to the Azure Synapse source data warehouse, choose Add source.
  5. Choose Azure Synapse and choose Next.
  6. For Connection name, enter a name (for example, olap-azure-synapse).

AWS SCT displays this name in the object tree in left pane.

  1. For Server name, enter your Azure Synapse server name.
  2. For SQL pool, enter your Azure Synapse pool name.
  3. Enter a user name and password.
  4. Choose Test connection to verify that AWS SCT can connect to your source Azure Synapse project.
  5. When the connection is successfully validated, choose Ok and Connect.

Connect to the Amazon Redshift target

Follow these steps to connect to Amazon Redshift:

  1. In AWS SCT, choose Add target.
  2. Choose Amazon Redshift, then choose Next.
  3. For Connection name, enter a name to describe the Amazon Redshift connection.

AWS SCT displays this name in the object tree in the right pane.

  1. For Server name, enter the Redshift Serverless workgroup endpoint you captured earlier.
  2. For Server port, enter 5439.
  3. For Database, enter dev.
  4. For User name, enter the user name you chose when creating the Redshift Serverless workgroup.
  5. For Password, enter the password you chose when creating the Redshift Serverless workgroup.
  6. Deselect Use AWS Glue.
  7. Choose Test connection to verify that AWS SCT can connect to your target Redshift workgroup.
  8. When the test is successful, choose OK.
  9. Choose Connect to connect to the Amazon Redshift target.

Alternatively, you can use connection values that are stored in AWS Secrets Manager.

Convert the Azure Synapse schema to a Redshift data warehouse

After you create the source and target connections, you will see the source Azure Synapse object tree in the left pane and the target Amazon Redshift object tree in the right pane. We then create mapping rules to describe the source target pair for the Azure Synapse to Amazon Redshift migration.

Follow these steps to convert the Azure Synapse dataset to Amazon Redshift format:

  1. In the left pane, choose (right-click) the schema you want to convert.
  2. Choose Convert schema.
  3. In the dialog box, choose Yes.

When the conversion is complete, you will see a new schema created in the Amazon Redshift pane (right pane) with the same name as your Azure Synapse schema.

The sample schema we used has three tables; you can see these objects in Amazon Redshift format in the right pane. AWS SCT converts all the Azure Synapse code and data objects to Amazon Redshift format. You can also use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects, but some objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from Main view to Assessment Report view as shown in the following screenshot.

The Summary tab shows objects that were converted automatically and objects that were not converted automatically. Green represents automatically converted objects or objects with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action items tab shows the recommended actions for each conversion issue. If you choose an action item from the list, AWS SCT highlights the object that the action item applies to.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database and schema show you the effort required to design and implement the recommendations for converting action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. AWS SCT completes some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it in the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the ID column data type from decimal(3,0) in Azure Synapse to the smallint data type in Amazon Redshift.

Apply the converted schema to the target Redshift data warehouse

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.

Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this section, we configure AWS SCT extraction agents to extract data from Azure Synapse and migrate to Amazon Redshift. For this post, we install the AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For very large datasets, AWS SCT supports the use of multiple data extraction agents running on several instances to maximize throughput and increase the speed of data migration.

Generate trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all data passed between the applications remains private and integral. To use SSL communication, you need to generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, choose Settings, Global settings, and Security.
  2. Choose Generate trust and key store.

  1. Enter a name and password for the trust and key stores.
  2. Enter a location to store them.
  3. Choose Generate, then choose OK.

Install and configure the data extraction agent

In the installation package for AWS SCT, you can find a subfolder called agents (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Service port, enter the port number the agent listens on. It is 8192 by default.
  2. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data.

The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

  1. For Enter Redshift JDBC driver file or files, enter the location where you downloaded the Redshift JDBC drivers.
  2. For Add the Amazon Redshift driver, enter YES.
  3. For Enable SSL communication, enter yes. Enter No here if you don’t want to use SSL.
  4. Choose Next.

  1. For Trust store path, enter the storage location you specified when creating the trust and key store.
  2. For Trust store password, enter the password for the trust store.
  3. For Enable client SSL authentication, enter yes.
  4. For Key store path, enter the storage location you specified when creating the trust and key store.
  5. For Key store password, enter the password for the key store.
  6. Choose Next.

Start the data extraction agent

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, use C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed for your operating system. To stop an agent, run the same command but replace start with stop. To restart an agent, run the same RestartAgent.bat file.

Note that you should have administrator access to run those commands.

Register the data extraction agent

Follow these steps to register the data extraction agent:

  1. In AWS SCT, change the view to Data Migration view choose Register.
  2. Select Redshift data agent, then choose OK.

  1. For Description, enter a name to identify the agent.
  2. For Host name, if you installed the extraction agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT extraction agent is installed. It is recommended to install extraction agents on Linux for better performance.
  3. For Port, enter the number you used for the listening port (default 8192) when installing the AWS SCT extraction agent.
  4. Select Use SSL to encrypt AWS SCT connection to Data Extraction Agent.

  1. If you’re using SSL, navigate to the SSL tab.
  2. For Trust store, choose the trust store you created earlier.
  3. For Key store, choose the key store you created earlier.
  4. Choose Test connection.
  5. After the connection is validated successfully, choose OK and Register.

Create a local data migration task

To migrate data from Azure Synapse Analytics to Amazon Redshift, you create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local data migration task:

  1. In AWS SCT, under the schema name in the left pane, choose (right-click) the table you want to migrate (for this post, we use the table tbl_currency).
  2. Choose Create Local task.

  1. Choose from the following migration modes:
    1. Extract the source data and store it on a local PC or virtual machine where the agent runs.
    2. Extract the data and upload it to an S3 bucket.
    3. Extract the data, upload it to Amazon S3, and copy it into Amazon Redshift. (We choose this option for this post.)

  1. On the Advanced tab, provide the extraction and copy settings.

  1. On the Source server tab, make sure you are using the current connection properties.

  1. On the Amazon S3 settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket you created earlier.

The AWS SCT data extraction agent uploads the data in those S3 buckets and folders before copying it to Amazon Redshift.

  1. Choose Test Task.

  1. When the task is successfully validated, choose OK, then choose Create.

Start the local data migration task

To start the task, choose Start or Restart on the Tasks tab.

First, the data extraction agent extracts data from Azure Synapse. Then the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.

At this point, AWS SCT has successfully migrated data from the source Azure Synapse table to the Redshift table.

View data in Amazon Redshift

After the data migration task is complete, you can connect to Amazon Redshift and validate the data. Complete the following steps:

  1. On the Amazon Redshift console, navigate to the Query Editor v2.
  2. Open the Redshift Serverless workgroup you created.
  3. Choose Query data.

  1. For Database, enter a name for your database.
  2. For Authentication, select Federated user
  3. Choose Create connection.

  1. Open a new editor by choosing the plus sign.
  2. In the editor, write a query to select from the schema name and table or view name you want to verify.

You can explore the data, run ad-hoc queries, and make visualizations, charts, and views.

The following screenshot is the view of the source Azure Synapse dataset we used in this post.

Clean up

Follow the steps in this section to clean up any AWS resources you created as part of this post.

Stop the EC2 instance

Follow these steps to stop the EC2 instance:

  1. On the Amazon EC2 console, in the navigation pane, choose Instances.
  2. Select the instance you created.
  3. Choose Instance state, then choose Terminate instance.
  4. Choose Terminate when prompted for confirmation.

Delete the Redshift Serverless workgroup and namespace

Follow these steps to delete the Redshift Serverless workgroup and namespace:

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workspace you created
  2. On the Actions menu, choose Delete workgroup.
  3. Select Delete the associated namespace.
  4. Deselect Create final snapshot.
  5. Enter delete in the confirmation text box and choose Delete.

Delete the S3 bucket

Follow these steps to delete the S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose the bucket you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. This post discussed how a data migration task extracts, downloads, and migrates data from Azure Synapse to Amazon Redshift. The solution we presented performs a one-time migration of database objects and data. Data changes made in Azure Synapse when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to Azure Synapse on hold or rerun the ETL jobs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

To get started, download and install AWS SCT, sign in to the AWS Management Console, check out Redshift Serverless, and start migrating!


About the Authors

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/run-apache-hive-workloads-using-spark-sql-with-amazon-emr-on-eks/

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Using Spark SQL to run Hive workloads provides not only the simplicity of SQL-like queries but also taps into the exceptional speed and performance provided by Spark. Spark SQL is an Apache Spark module for structured data processing. One of its most popular use cases is to read and write Hive tables with connectivity to a persistent Hive metastore, supporting Hive SerDes and user-defined functions.

Starting from version 1.2.0, Apache Spark has supported queries written in HiveQL. HiveQL is a SQL-like language that produces data queries containing business logic, which can be converted to Spark jobs. However, this feature is only supported by YARN or standalone Spark mode. To run HiveQL-based data workloads with Spark on Kubernetes mode, engineers must embed their SQL queries into programmatic code such as PySpark, which requires additional effort to manually change code.

Amazon EMR on Amazon EKS provides a deployment option for Amazon EMR that you can use to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).

Amazon EMR on EKS release 6.7.0 and later include the ability to run SparkSQL through the StartJobRun API. As a result of this enhancement, customers will now be able to supply SQL entry-point files and run HiveQL queries as Spark jobs on EMR on EKS directly. The feature is available in all AWS Regions where EMR on EKS is available.

Use case

FINRA is one of the largest Amazon EMR customers that is running SQL-based workloads using the Hive on Spark approach. FINRA, Financial Industry Regulatory Authority, is a private sector regulator responsible for analyzing equities and option trading activity in the US. To look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust set of big data tools in the AWS Cloud to support these activities.

FINRA centralizes all its data in Amazon Simple Storage Service (Amazon S3) with a remote Hive metastore on Amazon Relational Database Service (Amazon RDS) to manage their metadata information. They use various AWS analytics services, such as Amazon EMR, to enable their analysts and data scientists to apply advanced analytics techniques to interactively develop and test new surveillance patterns and improve investor protection. To make these interactions more efficient and productive, FINRA modernized their hive workloads in Amazon EMR from its legacy Hive on MapReduce to Hive on Spark, which resulted in query performance gains between 50 and 80 percent.

Going forward, FINRA wants to further innovate the interactive big data platform by moving from a monolithic design pattern to a job-centric paradigm, so that it can fulfill future capacity requirements as its business grows. The capability of running Hive workloads using SparkSQL directly with EMR on EKS is one of the key enablers that helps FINRA continuously pursue that goal.

Additionally, EMR on EKS offers the following benefits to accelerate adoption:

  • Fine-grained access controls (IRSA) that are job-centric to harden customers’ security posture
  • Minimized adoption effort as it enables direct Hive query submission as a Spark job without code changes
  • Reduced run costs by consolidating multiple software versions for Hive or Spark, unifying artificial intelligence and machine learning (AI/ML) and exchange, transform, and load (ETL) pipelines into a single environment
  • Simplified cluster management through multi-Availability Zone support and highly responsive autoscaling and provisioning
  • Reduced operational overhead by hosting multiple compute and storage types or CPU architectures (x86 & Arm64) in a single configuration
  • Increased application reusability and portability supported by custom docker images, which allows them to encapsulate all necessary dependencies

Running Hive SQL queries on EMR on EKS

Prerequisites

Make sure that you have AWS Command Line Interface (AWS CLI) version 1.25.70 or later installed. If you’re running AWS CLI version 2, you need version 2.7.31 or later. Use the following command to check your AWS CLI version:

aws --version

If necessary, install or update the latest version of the AWS CLI.

Solution Overview

To get started, let’s look at the following diagram. It illustrates a high-level architectural design and different services that can be used in the Hive workload. To match with FINRA’s use case, we chose an Amazon RDS database as the remote Hive metastore. Alternatively, you can use AWS Glue Data Catalog as the metastore for Hive if needed. For more details, see the aws-sample github project.

The minimum required infrastructure is:

  • An S3 bucket to store a Hive SQL script file
  • An Amazon EKS cluster with EMR on EKS enabled
  • An Amazon RDS for MySQL database in the same virtual private cloud (VPC) as the Amazon EKS cluster
  • A standalone Hive metastore service (HMS) running on the EKS cluster or a small Amazon EMR on EC2 cluster with the Hive application installed

To have a quick start, run the sample CloudFormation deployment. The infrastructure deployment includes the following resources:

Create a Hive script file

Store a few lines of Hive queries in a single file, then upload the file to your S3 bucket, which can be found in your AWS Management Console in the AWS CloudFormation Outputs tab. Search for the key value of CODEBUCKET as shown in preceding screenshot. For a quick start, you can skip this step and use the sample file stored in s3://<YOUR_S3BUCKET>/app_code/job/set-of-hive-queries.sql. The following is a code snippet from the sample file :

-- drop database in case switch between different hive metastore

DROP DATABASE IF EXISTS hiveonspark CASCADE;
CREATE DATABASE hiveonspark;
USE hiveonspark;

--create hive managed table
DROP TABLE IF EXISTS testtable purge;
CREATE TABLE IF NOT EXISTS testtable (`key` INT, `value` STRING) using hive;
LOAD DATA LOCAL INPATH '/usr/lib/spark/examples/src/main/resources/kv1.txt' INTO TABLE testtable;
SELECT * FROM testtable WHERE key=238;

-- test1: add column
ALTER TABLE testtable ADD COLUMNS (`arrayCol` Array<int>);
-- test2: insert
INSERT INTO testtable VALUES 
(238,'val_238',array(1,3)),
(238,'val_238',array(2,3));
SELECT * FROM testtable WHERE key=238;
-- test3: UDF
CREATE TEMPORARY FUNCTION hiveUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';
SELECT `key`,`value`,hiveUDF(arrayCol) FROM testtable WHERE key=238;
-- test4: CTAS table with parameter
DROP TABLE IF EXISTS ctas_testtable purge;
CREATE TABLE ctas_testtable 
STORED AS ORC
AS
SELECT * FROM testtable;
SELECT * FROM ctas_testtable WHERE key=${key_ID};
-- test5: External table mapped to S3
CREATE EXTERNAL TABLE IF NOT EXISTS amazonreview
( 
  marketplace string, 
  customer_id string, 
  review_id  string, 
  product_id  string, 
  product_parent  string, 
  product_title  string, 
  star_rating  integer, 
  helpful_votes  integer, 
  total_votes  integer, 
  vine  string, 
  verified_purchase  string, 
  review_headline  string, 
  review_body  string, 
  review_date  date, 
  year  integer
) 
STORED AS PARQUET 
LOCATION 's3://${S3Bucket}/app_code/data/toy/';
SELECT count(*) FROM amazonreview;

Submit the Hive script to EMR on EKS

First, set up the required environment variables. See the shell script post-deployment.sh:

stack_name='HiveEMRonEKS'
export VIRTUAL_CLUSTER_ID=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='VirtualClusterId'].OutputValue" --output text)
export EMR_ROLE_ARN=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='EMRExecRoleARN'].OutputValue" --output text)
export S3BUCKET=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='CODEBUCKET'].OutputValue" --output text)

Connect to the demo EKS cluster:

echo `aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?starts_with(OutputKey,'eksclusterEKSConfig')].OutputValue" --output text` | bash
kubectl get svc

Ensure the entryPoint path is correct, then submit the set-of-hive-queries.sql to EMR on EKS.

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sparksql-test \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSqlJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/set-of-hive-queries.sql",
      "sparkSqlParameters": "-hivevar S3Bucket='$S3BUCKET' -hivevar Key_ID=238"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.sql.warehouse.dir": "s3://'$S3BUCKET'/warehouse/",
          "spark.hive.metastore.uris": "thrift://hive-metastore:9083"
        }
      }
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

Note that the shell script referenced the set-of-hive-queries.sql Hive script file as an entry point script. It uses the sparkSqlJobDriver attribute, not the usual sparkSubmitJobDriver designed for Spark applications. In the sparkSqlParameters section, we pass in two environment variables S3Bucket and key_ID to the Hive script.

The property "spark.hive.metastore.uris": "thrift://hive-metastore:9083" sets a connection to a Hive Metastore Service (HMS) called hive-metastore, which is running as a Kubernetes service on the demo EKS cluster as shown in the follow screenshot. If you’re running the thrift service on Amazon EMR on EC2, the URI should be thrift://<YOUR_EMR_MASTER_NODE_DNS_NAME>:9083. If you chose AWS Glue Data Catalog as your Hive metastore, replace the entire property with "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory".

Finally, check the job status using the kubectl command line tool: kubectl get po -n emr --watch

Expected output

  1. Go to the Amazon EMR console.
  2. Navigate to the side menu Virtual clusters, then select the HiveDemo cluster, You can see an entry for the SparkSQL test job.
  3. Click Spark UI hyperlink to monitor each query’s duration and status on a web interface.
  4. To query the Amazon RDS based Hive metastore, you need a MYSQL client tool installed. To make it easier, the sample CloudFormation template has installed the query tool on master node of a small Amazon EMR on EC2 cluster.
  5. Find the EMR master node by running the following command:
aws ec2 describe-instances --filter Name=tag:project,Values=$stack_name Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query 'Reservations[].Instances[].InstanceId[]'

  1. Go to the Amazon EC2 console and connect to the master node through the Session Manager.
  2. Before querying the MySQL RDS database (the Hive metastore), run the following commands on your local machine to get the credentials:
    stack_name='HiveEMRonEKS' 
    export secret_name=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='HiveSecretName'].OutputValue" --output text) 
    export HOST_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.host')
    export PASSWORD=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.password')
    export DB_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.dbname')
    export USER_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.username')
    echo -e "\n host: $HOST_NAME\n DB: $DB_NAME\n passowrd: $PASSWORD\n username: $USER_NAME\n"
    

  3. After connected through Session Manager, query the Hive metastore from your Amazon EMR master node.
    mysql -u admin -P 3306 -p -h <YOUR_HOST_NAME>
    Enter password:<YOUR_PASSWORD>
    
    # Query the metastore
    MySQL[(none)]> Use HiveEMRonEKS;
    MySQL[HiveEMRonEKS]> select * from DBS;
    MySQL[HiveEMRonEKS]> select * from TBLS;
    MySQL[HiveEMRonEKS]> exit();

  4. Validate the Hive tables (created by set-of-hive-queries.sql) through the interactive Hive CLI tool.

Note:-Your query environment must have the Hive Client tool installed and a connection to your Hive metastore or AWS Glue Data Catalog. For the testing purpose, you can connect to the same Amazon EMR on EC2 master node and query your Hive tables. The EMR cluster has been pre-configured with the required setups.

sudo su
hive
hive> show databases;

Clean up

To avoid incurring future charges, delete the resources generated if you don’t need the solution anymore. Run the following cleanup script.

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/delete_all.sh | bash

Go to the CloudFormation console and manually delete the remaining resources if needed.

Conclusion

Amazon EMR on EKS releases 6.7.0 and higher include a Spark SQL job driver so that you can directly run Spark SQL scripts via the StartJobRun API. Without any modifications to your existing Hive scripts, you can directly execute them as a SparkSQL job on Amazon EMR on EKS.

FINRA is one of the largest Amazon EMR customers. It runs over 400 Hive clusters for its analysts who need to interactively query multi-petabyte data sets. Modernizing its Hive workloads with SparkSQL gives FINRA a 50 to 80 percent query performance improvement. The support to run Spark SQL through the StartJobRun API in EMR on EKS has further enabled FINRA’s innovation in data analytics.

In this post, we demonstrated how to submit a Hive script to Amazon EMR on EKS and run it as a SparkSQL job. We encourage you to give it a try and are keen to hear your feedback and suggestions.


About the authors

Amit Maindola is a Senior Data Architect focused on big data and analytics at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Melody Yang is a Senior Big Data Solutions Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering, and DataOps.

Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/unleash-the-power-of-snapshot-management-to-take-automated-snapshots-using-amazon-opensearch-service/

Data is the lifeblood of any organization, and the importance of protecting it cannot be overstated. Starting with OpenSearch v2.5 in Amazon OpenSearch Service, we introduced Snapshot Management, which automates the process of taking snapshots of your domain. Snapshot Management helps you create point-in-time backups of your domain using OpenSearch Dashboards, including both data and configuration settings (for visualizations and dashboards). You can use these snapshots to restore your cluster to a specific state, recover from potential failures, and even clone environments for testing or development purposes.

Before this release, to automate the process of taking snapshots, you needed to use the snapshot action of OpenSearch’s Index State Management (ISM) feature. With ISM, you could only back up a particular index. Automating backup for multiple indexes required you to write custom scripts or use external management tools. With Snapshot Management, you can automate snapshotting across multiple indexes to safeguard your data and ensure its durability and recoverability.

In this post, we share how to use Snapshot Management to take automated snapshots using OpenSearch Service.

Solution overview

We demonstrate the following high-level steps:

  1. Register a snapshot repository in OpenSearch Service (a one-time process).
  2. Configure a sample ISM policy to migrate the indexes from hot storage to the UltraWarm storage tier after the indexes meet a specific condition.
  3. Create a Snapshot Management policy to take an automated snapshot for all indexes present across different storage tiers within a domain.

As of this writing, Snapshot Management doesn’t support single snapshot creation for all indexes present across different storage tiers within OpenSearch Service. For example, if you try to create a snapshot on multiple indexes with * and some indexes are in the warm tier, the snapshot creation will fail.

To overcome this limitation, you can use index aliases, with one index alias for each type of storage tier. For example, every new index created in the cluster will belong to the hot alias. When the index is moved to the UltraWarm tier via ISM, the alias for the index will be modified to warm, and the index will be removed from the hot alias.

Register a manual snapshot repository

To register a manual snapshot repository, you must create and configure an Amazon Simple Storage Service (Amazon S3) bucket and AWS Identity and Access Management (IAM) roles. For more information, refer to Prerequisites. Complete the following steps:

  1. Create an S3 bucket to store snapshots for your OpenSearch Service domain.
  2. Create an IAM role called SnapshotRole with the following IAM policy to delegate permissions to OpenSearch Service (provide the name of your S3 bucket):
{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": ["s3:ListBucket"],
        "Effect": "Allow",
        "Resource": ["arn:aws:s3:::<s3-bucket-name>"]
    }, {
        "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
        "Effect": "Allow",
        "Resource": ["arn:aws:s3:::<s3-bucket-name>/*"]
    }]
}
  1. Set the trust relationship for SnapshotRole as follows:
{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
            "Service": "es.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }]
}
  1. Create a new IAM role called RegisterSnapshotRepo, which delegates iam:PassRole and es:ESHttpPut (provide your AWS account and domain name):
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::<aws account id>:role/SnapshotRole"
    }, {
        "Effect": "Allow",
        "Action": "es:ESHttpPut",
        "Resource": "arn:aws:es:region:<aws account id>:domain/<domain-name>/*"
    }]
}
  1. If you have enabled fine-grained access control for your domain, map the snapshot role manage_snapshots to your RegisterSnapshotRepo IAM role in OpenSearch Service.
  2. Now you can use Python code like the following example to register the S3 bucket you created as a snapshot repository for your domain. Provide your host name, Region, snapshot repo name, and S3 bucket. Replace "arn:aws:iam::123456789012:role/SnapshotRole" with the ARN of your SnapshotRole. The Boto3 session should use the RegisterSnapshotRepo IAM role.
import boto3
import requests
from requests_aws4auth import AWS4Auth
 
host = '<host>' # domain endpoint with trailing /
region = '<region>' # e.g. us-west-1
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
 
# To Register the repository

path = '_snapshot/<snapshot-repo-name>'
url = host + path
 
payload = {
  "type": "s3",
  "settings": {
    "bucket": "<s3-bucket-name>",
    "region": "<region>",
    "role_arn": "arn:aws:iam::123456789012:role/SnapshotRole"
  }
}
 
headers = {"Content-Type": "application/json"}
 
r = requests.put(url, auth=awsauth, json=payload, headers=headers, timeout=300)
 
print(r.status_code)
print(r.text)

The S3 bucket used as a repository must be in the same Region as your domain. You can run the preceding code in any compute instance that has connectivity to your OpenSearch Service domain, such as Amazon Elastic Compute Cloud (Amazon EC2), AWS Cloud9, or AWS Lambda. If your domain is within a VPC, then the compute instance should be running inside the same VPC or have connectivity to the VPC.

If you have to register the snapshot repository from a local machine, replace the following lines in the preceding code:

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

Replace this code with the following and assume the RegisterSnapshotRepo role (make sure the IAM entity that you are using has appropriate permission to assume the RegisterSnapshotRepo role). Modify "arn:aws:iam::123456789012:role/RegisterSnapshotRepo" with the ARN of your RegisterSnapshotRepo role.

sts = boto3.Session().client("sts")
response = sts.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/RegisterSnapshotRepo",
    RoleSessionName="Snapshot-Session"
)

awsauth = AWS4Auth(response['Credentials']['AccessKeyId'], response['Credentials']['SecretAccessKey'], region, service, session_token=response['Credentials']['SessionToken'])

Upon successfully running the sample code, you should receive output such as the following:

200
{"acknowledged":true}

You can also verify that you have successfully registered the snapshot repository by accessing OpenSearch Dashboards in your domain and navigating to Snapshots Managements, Repositories.

Create an index template

Index templates enable you to automatically assign configuration when new indexes are created that match a wildcard index pattern.

  1. Navigate to your domain’s OpenSearch Dashboards and choose the Dev Tools tab.
  2. Enter the following text in the left pane and choose the play icon to run it.

The index template applies to newly created indexes that match the pattern of "log*". These indexes are attached to the alias named hot and the replica count is set to 1.

PUT _index_template/snapshot_template
{
  "index_patterns" : ["log*"],
  "template": {
      "settings": {
      "number_of_replicas": 1
    },
    "aliases" : {
        "hot" : {}
    }
  }
}

Note that in the preceding example, you assign the template to indexes that match "log*". You can modify index_patterns for your use case.

Create an ISM policy

In this step, you create the ISM policy that updates the alias of an index before it is migrated to UltraWarm. The policy also performs hot to warm migration. This is to overcome the limitations where a snapshot can’t be taken across two storage tiers (hot and UltraWarm). Complete the following steps:

  1. Navigate to the Dev Tools page of OpenSearch Dashboards.
  2. Create the ISM policy using the following command (modify the values of index_patterns and min_index_age accordingly):
PUT /_plugins/_ism/policies/alias_policy
{
    "policy": {
        "policy_id": "alias_policy",
        "description": "Example Policy for changing the alias and performing the warm migration",
        "default_state": "hot_alias",
        "states": [{
            "name": "hot_alias",
            "actions": [],
            "transitions": [{
                "state_name": "warm",
                "conditions": {
                    "min_index_age": "30d"
                }
            }]
        }, {
            "name": "warm",
            "actions": [{
                "alias": {
                    "actions": [{
                        "remove": {
                            "aliases": ["hot"]
                        }
                    }, {
                        "add": {
                            "aliases": ["warm"]
                        }
                    }]
                }
            }, {
                "retry": {
                    "count": 5,
                    "backoff": "exponential",
                    "delay": "1h"
                },
                "warm_migration": {}
            }],
            "transitions": []
        }],
        "ism_template": [{
            "index_patterns": ["log*"],
            "priority": 100
        }]
    }
}

Create a snapshot policy

In this step, you create a snapshot policy, which takes a snapshot for all the indexes aliased as hot and stores them to your repository at the scheduled time in the cron expression (midnight UTC). Complete the following steps:

  1. Navigate to the Dev Tools page of OpenSearch Dashboards.
  2. Create the snapshot policy using the following command (modify the value of snapshot-repo-name to the name of the snapshot repository you registered previously):
POST _plugins/_sm/policies/daily-snapshot-for-manual-repo
{
    "description": "Policy for Daily Snapshot in the Manual repo",
    "creation": {
      "schedule": {
        "cron": {
          "expression": "0 0 * * *",
          "timezone": "UTC"
        }
      }
    },
    "deletion": {
      "schedule": {
        "cron": {
          "expression": "0 1 * * *",
          "timezone": "UTC"
        }
      },
      "condition": {
        "min_count": 1,
        "max_count": 40
      }
    },
    "snapshot_config": {
      "indices": "hot",
      "repository": "snapshot-repo-name"
    }
}

Clean up

Snapshots that you create incur cost in the S3 bucket used as the repository. With the new Snapshots Management feature, you can easily list and delete unwanted snapshots, and delete the ISM policy to stop taking manual snapshots directly from OpenSearch Dashboards.

Conclusion

With the new Snapshot Management capabilities of OpenSearch Service, you can create regular backups and ensure the availability of your data even in the event of unexpected events or disasters. In this post, we discussed essential concepts such as snapshot repositories, automated snapshot lifecycle policies, and Snapshot Management options, enabling you to make informed decisions when it comes to managing your data backups effectively. As you continue to explore and harness the potential of OpenSearch Service, incorporating Snapshot Management into your data protection strategy will undoubtedly provide you with the resilience and reliability needed to ensure business continuity.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Hendy Wijaya is a Senior OpenSearch Specialist Solutions Architect at Amazon Web Services. Hendy enables customers to leverage AWS services to achieve their business objectives and gain competitive advantages. He is passionate in collaborating with customers in getting the best out of OpenSearch and Amazon OpenSearch

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.

Training machine learning models on premises for data residency with AWS Outposts rack

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/training-machine-learning-models-on-premises-for-data-residency-with-aws-outposts-rack/

This post is written by Sumit Menaria, Senior Hybrid Solutions Architect, and Boris Alexandrov, Senior Product Manager-Tech. 

In this post, you will learn how to train machine learning (ML) models on premises using AWS Outposts rack and datasets stored locally in Amazon S3 on Outposts. With the rise in data sovereignty and privacy regulations, organizations are seeking flexible solutions that balance compliance with the agility of cloud services. Healthcare and financial sectors, for instance, harness machine learning for enhanced patient care and transaction safety, all while upholding strict confidentiality. Outposts rack provide a seamless hybrid solution by extending AWS capabilities to any on-premises or edge location, providing you the flexibility to store and process data wherever you choose. Data sovereignty regulations are highly nuanced and vary by country. This blog post addresses data sovereignty scenarios where training datasets need to be stored and processed in a geographic location without an AWS Region.

Amazon S3 on Outposts

As you prepare datasets for ML model training, a key component to consider is the storage and retrieval of your data, especially when adhering to data residency and regulatory requirements.

You can store training datasets as object data in local buckets with Amazon S3 on Outposts. In order to access S3 on Outposts buckets for data operations, you need to create access points and route the requests via an S3 on Outposts endpoint associated with your VPC. These endpoints are accessible both from within the VPC as well as on premises via the local gateway.

S3 on Outposts connectivity options

Solution overview

Using this sample architecture, you are going to train a YOLOv5 model on a subset of categories of the Common Objects in Context (COCO) dataset. The COCO dataset is a popular choice for object detection tasks offering a wide variety of image categories with rich annotations. It is also available under the AWS Open Data Sponsorship Program via fast.ai datasets.

Architecture for ML training on Outposts rack

This example is based on an architecture using an Amazon Elastic Compute Cloud (Amazon EC2) g4dn.8xlarge instance for model training on the Outposts rack. Depending on your Outposts rack compute configuration, you can use different instance sizes or types and make adjustments to training parameters, such as learning rate, augmentation, or model architecture accordingly. You will be using the AWS Deep Learning AMI to launch your EC2 instance, which comes with frameworks, dependencies, and tools to accelerate deep learning in the cloud.

For the training dataset storage, you are going to use an S3 on Outposts bucket and connect to it from your on-premises location via the Outposts local gateway. The local gateway routing mode can be direct VPC routing or Customer-owned IP (CoIP) depending on your workload’s requirements. Your local gateway routing mode will determine the S3 on Outposts endpoint configuration that you need to use.

1. Download and populate training dataset

You can download the training dataset to your local client machine using the following AWS CLI command:

aws s3 sync s3://fast-ai-coco/ .

After downloading, unzip annotations_trainval2017.zip, val2017.zip and train2017.zip files.

$ unzip annotations_trainval2017.zip
$ unzip val2017.zip
$ unzip train2017.zip

In the annotations folder, the files which you need to use are instances_train2017.json and instances_val2017.json, which contain the annotations corresponding to the images in the training and validation folders.

2. Filtering and preparing training dataset

You are going to use the training, validation, and annotation files from the COCO dataset. The dataset contains over 100K images across 80 categories, but to keep the training simple, you can focus on 10 specific categories of popular food items in supermarket shelves: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, and cake. (Because who doesn’t like a bite after a model training.) Applications for training such models could be self-stock monitoring, automatic checkouts, or product placement optimization using computer vision in retail stores. Since YOLOv5 uses a specific annotations (labels) format, you need to convert the COCO dataset annotation to the target annotation.

3. Load training dataset to S3 on Outposts bucket

In order to load the training data on S3 on Outposts you need to first create a new bucket using the AWS Console or CLI, as well as an access point and endpoint for the VPC. You can use a bucket style access point alias to load the data, using the following CLI command:

$ cd /your/local/target/upload/path/
$ aws s3 sync . s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3

Replace the alias in the above CLI command with corresponding bucket alias name for your environment. The s3 sync command syncs the folders in the same structure containing the images and labels for the training and validation data, which you will be using later for loading it to the EC2 instance for model training.

4. Launch the EC2 instance

You can launch the EC2 instance with the Deep Learning AMI based on this getting started tutorial. For this exercise, the Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) has been used.

5. Download YOLOv5 and install dependencies

Once you ssh into the EC2 instance, activate the pre-configured PyTorch environment and clone the YOLOv5 repository.

$ ssh -i /path/key-pair-name.pem ubuntu@instance-ip-address
$ conda activate pytorch
$ git clone https://github.com/ultralytics/yolov5.git
$ cd yolov5

Then, and install its necessary dependencies.

$ pip install -U -r requirements.txt

To ensure the compatibility between various packages, you may need to modify existing packages on your instance running the AWS Deep Learning AMI.

6. Load the training dataset from S3 on Outposts to the EC2 instance

For copying the training dataset to the EC2 instance, use the s3 sync CLI command and point it to your local workspace.

aws s3 sync s3://trainingdata-o0a2b3c4d5e6d7f8g9h10f--op-s3 .

7. Prepare the configuration files

Create the data configuration files to reflect your dataset’s structure, categories, and other parameters.
data.yml

train: /your/ec2/path/to/data/images/train 
val: /your/ec2/path/to/data/images/val 
nc: 10 # Number of classes in your dataset 
names: ['banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake']

Create the model training parameter file using the sample configuration file from the YOLOv5 repository. You will need to update the number of classes to 10, but you can also change other parameters as you fine tune the model for performance.

parameters.yml:

# Parameters
nc: 10 # number of classes in your dataset
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32

# Backbone
backbone:
[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4
[-1, 3, C3, [128]],
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8
[-1, 6, C3, [256]],
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16
[-1, 9, C3, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
[-1, 3, C3, [1024]],
[-1, 1, SPPF, [1024, 5]], # 9
]

# Head
head:
[[-1, 1, Conv, [512, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [512, False]], # 13

[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [256, False]], # 17 (P3/8-small)

[-1, 1, Conv, [256, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [512, False]], # 20 (P4/16-medium)

[-1, 1, Conv, [512, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [1024, False]], # 23 (P5/32-large)

[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)

At this stage, the directory structure should look like below:

Directory tree showing training dataset and model configuration structure]

8. Train the model

You can run the following command to train the model. The batch-size and epochs can vary depending on your vCPU and GPU configuration and you can further modify these values or add weights as you try with additional rounds of training.

$ python3 train.py —img-size 640 —batch-size 32 —epochs 50 —data /your/path/to/configuation_files/dataconfig.yaml —cfg /your/path/to/configuation_files/parameters.yaml

You can monitor the model performance as it iterates through each epoch

Starting training for 50 epochs...

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/49 6.7G 0.08403 0.05 0.04359 129 640: 100%|██████████| 455/455 [06:14<00:00,
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:05<0
all 575 2114 0.216 0.155 0.0995 0.0338

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/49 8.95G 0.07131 0.05091 0.02365 179 640: 100%|██████████| 455/455 [06:00<00:00,
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:04<00:00, 1.97it/s]
all 575 2114 0.242 0.144 0.11 0.04

Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
2/49 8.96G 0.07068 0.05331 0.02712 154 640: 100%|██████████| 455/455 [06:01<00:00, 1.26it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 9/9 [00:04<00:00, 2.23it/s]
all 575 2114 0.185 0.124 0.0732 0.0273

Once the model training finishes, you can see the validation results against the batch of validation dataset and evaluate the model’s performance using standard metrics.

Validating runs/train/exp/weights/best.pt...
Fusing layers... 
YOLOv5 summary: 157 layers, 7037095 parameters, 0 gradients, 15.8 GFLOPs
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 9/9 [00:06<00:00,  1.48it/s]
                   all        575       2114      0.282      0.222       0.16     0.0653
                banana        575        280      0.189      0.143     0.0759      0.024
                 apple        575        186      0.206      0.085     0.0418     0.0151
              sandwich        575        146      0.368      0.404      0.343      0.146
                orange        575        188      0.265      0.149     0.0863     0.0362
              broccoli        575        226      0.239      0.226      0.138     0.0417
                carrot        575        310      0.182      0.203     0.0971     0.0267
               hot dog        575        108      0.242      0.111     0.0929     0.0311
                 pizza        575        208      0.405      0.418      0.333       0.15
                 donut        575        228      0.352      0.241       0.19     0.0973
                  cake        575        234      0.369      0.235      0.203     0.0853
Results saved to runs/train/exp

Use the model for inference

In order to test the model performance, you can test it by passing a new image which is from a shelf in a supermarket with some of the objects that you trained the model on.

Sample inference image with 1 cake, 6 oranges, and 4 apples

(pytorch) ubuntu@ip-172-31-48-165:~/workspace/source/yolov5$ python3 detect.py --weights /home/ubuntu/workspace/source/yolov5/runs/train/exp/weights/best.pt —source /home/ubuntu/workspace/inference/Inference-image.jpg
<<omitted output>>
Fusing layers...
YOLOv5 summary: 157 layers, 7037095 parameters, 0 gradients, 15.8 GFLOPs
image 1/1 /home/ubuntu/workspace/inference/Inference-image.jpg: 640x640 4 apples, 6 oranges, 1 cake, 5.3ms
Speed: 0.6ms pre-process, 5.3ms inference, 1.1ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp7

The response from the preceding model inference indicates that it predicted 4 apples, 6 oranges, and 1 cake in the image. The prediction may differ based on the image type used, and while a single sample image can give you a sense of the model’s performance, it will not provide a comprehensive understanding. For a more complete evaluation, it’s always recommended to test the model on a larger and more diverse set of validation images. Additional training and tuning of your parameters or datasets may be required to achieve better prediction.

Clean Up

You can terminate the following resources used in this tutorial after you have successfully trained and tested the model:

Conclusion

The seamless integration of compute on AWS Outposts with S3 on Outposts, coupled with on-premises ML model training capabilities, offers organizations a robust solution to tackle data residency requirements. By setting up this environment, you can ensure that your datasets remain within desired geographies while still utilizing advanced machine learning models and cloud infrastructure. In addition to this, it remains essential to diligently review and fine-tune your implementation strategies and guard rails in place to ensure your data remains within the boundaries of your regulatory requirements. You can read more about architecting for data residency in this blog post.

Reference

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Post Syndicated from Mykhailo Kondak original https://aws.amazon.com/blogs/big-data/accelerate-your-data-warehouse-migration-to-amazon-redshift-part-7/

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in AWS.

Migrating a data warehouse can be complex. You have to migrate terabytes or petabytes of data from your legacy system while not disrupting your production workload. You also need to ensure that the new target data warehouse is consistent with upstream data changes so that business reporting can continue uninterrupted when you cut over to the new platform.

Previously, there were two main strategies to maintain data consistency after the initial bulk load during a migration to Amazon Redshift. You could identify the changed rows, perhaps using a filter on update timestamps, or you could modify your extract, transform, and load (ETL) process to write to both the source and target databases. Both of these options require manual effort to implement and increase the cost and risk of the migration project.

AWS Schema Conversion Tool (AWS SCT) could help you with initial bulk load from Azure Synapse Analytics, BigQuery, Greenplum Database, IBM Netezza, Microsoft SQL Server, Oracle, Snowflake, Teradata and Vertica. Now, we’re happy to share that AWS SCT has automated maintaining data consistency for you. If you’re migrating from an IBM Netezza data warehouse to Amazon Redshift, the AWS SCT data extractors will automatically capture changes from the source and apply them on the target. You configure a change data capture (CDC) migration task in AWS SCT, and it will extract the relevant data changes from IBM Netezza and apply them in a transactionally consistent order on Amazon Redshift. You need to configure the needed resources on IBM Netezza and start the data migration—the source database remains fully operational during the migration and replication.

In this post, we describe at a high-level how CDC tasks work in AWS SCT. Then we deep dive into an example of how to configure, start, and manage a CDC migration task. We look briefly at performance and how you can tune a CDC migration, and then conclude with some information about how you can get started on your own migration.

Accelerate your data warehouse migration to Amazon Redshift:

Solution overview

The following diagram shows the data migration and replication workflow with AWS SCT.

In the first step, your AWS SCT data extraction agent completes the full load of your source data to Amazon Redshift. Then the AWS SCT data extraction agent uses a history database in Netezza. The history database captures information about user activity such as queries, query plans, table access, column access, session creation, and failed authentication requests. The data extraction agent extracts information about transactions that you run in your source Netezza database and replicates them to your target Redshift database.

You can start ongoing replication automatically after you complete the full load. Alternatively, you can start CDC at a later time or on a schedule. For more information, refer to Configuring ongoing data replication.

At a high level, the ongoing replication flow is as follows.

At the start of the replication, the data extraction agent captures the last transaction identifier in the history table. The data extraction agent stores this value in the max_createxid variable. To capture the transaction ID, the agent runs the following query:

SELECT max(XID) AS max_createxid 
    FROM <DB_NAME>.<SCHEMA_NAME>."$hist_plan_prolog_n";

If this transaction ID value is different from the CDC start point, then the agent identifies the delta to replicate. This delta includes all transactions for the selected tables that happened after full load or after the previous replication. The data extraction agent selects the updated data from your source table.

From this updated data, AWS SCT creates two temporary tables. The first table includes all rows that you deleted from your source database and the old data of the rows that you updated. The second table includes all rows that you inserted and the new data of the rows that you updated. AWS SCT then uses these tables in JOIN clauses to replicate the changes to your target Redshift database.

Next, AWS SCT copies these tables to your Amazon Simple Storage Service (Amazon S3) bucket and uses this data to update your target Redshift cluster.

After updating your target database, AWS SCT deletes these temporary tables. Next, your data extraction agent sets the value of the CDC start point equal to the captured transaction ID (max_createxid). During the next data replication run, your agent will determine the delta to replicate using this updated CDC start point.

All changes that happen to your source database during the replication run will be captured in the next replication run. Make sure that you repeat the replication steps until the delta is equal to zero for each table that you included in the migration scope. At this point, you can cut over to your new databases.

Configure your source database

In your source Netezza database, create a history database and configure the history logging. Next, grant read permissions for all tables in the history database to the user that you use in the AWS SCT project. This user has the minimal permissions that are required to convert your source database schemas to Amazon Redshift.

Configure AWS SCT

Before you start data migration and replication with AWS SCT, make sure that you download the Netezza and Amazon Redshift drivers. Take note of the path to the folder where you saved these files. You will specify this path in the AWS SCT and data extraction agent settings.

To make sure the data extraction agents work properly, install the latest version of Amazon Corretto 11.

To configure AWS SCT, complete the following steps:

  1. After you create an AWS SCT project, connect to your source and target databases and set up the mapping rules. A mapping rule describes a source-target pair that defines the migration target for your source database schema.
  2. Convert your database schemas and apply them to Amazon Redshift if you haven’t done this yet. Make sure that the target tables exist in your Redshift database before you start data migration.
  3. Now, install the data extraction agent. The AWS SCT installer includes the installation files for data extraction agents in the agents folder. Configure your data extraction agents by adding the listening port number and the path to the source and target database drivers. For the listening port, you can proceed with the default value. For database drivers, enter the path that you noted before.

The following diagram shows how the AWS SCT data extraction agents work.

After you install the data extraction agent, register it in AWS SCT.

  1. Open the data migration view in AWS SCT and choose Register.
  2. Enter the name of your agent, the host name, and the port that you configured in the previous step. For the host name, you can use the localhost 0.0.0.0 if you run the agent on the same machine where you installed the data extraction agent.

Create and run a CDC task

Now you can create and manage your data migration and replication tasks. To do so, complete the following steps:

  1. Select the tables in your source database to migrate, open the context (right-click) menu, and choose Create local task.
  2. Choose your data migration mode (for this post, choose Extract, upload and copy to replicate data changes from your source database):
    1. Extract only – Extract your data and save it to your local working folders.
    2. Extract and upload – Extract your data and upload it to Amazon S3.
    3. Extract, upload and copy – Extract your data, upload it to Amazon S3, and copy it into your Redshift data warehouse.
  3. Choose your encryption type. Make sure that you configure encryption for safe and secure data migrations.
  4. Select Enable CDC.

  1. After this, you can switch to the CDC settings tab.
  2. For CDC mode, you can choose from the following options:
    1. Migrate existing data and replicate ongoing changes – Migrate all existing source data and then start the replication. This is the default option.
    2. Replicate data changes only – Start data replication immediately.

Sometimes you don’t need to migrate all existing source data. For example, if you have already migrated your data, you can start the data replication by choosing Replicate data changes only.

  1. If you choose Replicate data changes only, you can also set the Last CDC point to configure the replication to start from this point. If it is not set, AWS SCT data extraction agents replicate all changes that occur after your replication task is started.

If your replication task failed, you can restart the replication from the point of failure. You can find the identifier of the last migrated CDC point on the CDC processing details tab in AWS SCT, set Last CDC point and start the task again. This will allow AWS SCT data extraction agents to replicate all changes in your source tables to your target database without gaps.

  1. You can also configure when you want to schedule the CDC runs to begin.

If you select Immediately, the first replication run immediately after your agent completes the full load. Alternatively, you can specify the time and date when you want to start the replication.

  1. Also, you can schedule when to run the replication again. You can enter the number of days, hours, or minutes when to repeat the replication runs. Set these values depending on the intensity of data changes in your source database.
  2. Finally, you can set the end date when AWS SCT will stop running the replication.

  1. On the Amazon S3 settings tab, you can connect your AWS SCT data extraction agent with your Amazon S3 bucket.

You don’t need to do this step if you have configured the AWS service profile in the global application settings.

  1. After you have configured all settings, choose Create to create a CDC task.

  1. Start this task in the AWS SCT user interface.

The following screenshots show examples of the AWS SCT user interface once you started tasks.

You can run multiple CDC tasks in parallel at the same time. For example, you can include different sets of source tables in each task to replicate the changes to different target Redshift clusters. AWS SCT handles these replication tasks and distributes resources correctly to minimize the replication time.

Data replication limitations

There are a few limitations in AWS SCT data replication:

  • Changes in your source database don’t trigger the replication run because AWS SCT isn’t able to automate these runs (as of this writing). You can instead run the data replication tasks on a predefined schedule.
  • AWS SCT doesn’t replicate TRUNCATE and DDL statements. If you change the structure of your source table or truncate it, then you must run the same statements in your target database. You should make these changes manually because AWS SCT isn’t aware of structure updates.

End-to-end example

Now that you know how to create a local replication task in AWS SCT, we deep dive and show how AWS SCT performs the extract and load processes.

  1. First, we run the following code to check that we correctly configured our source Netezza database. To use this code example, change the name of your history database.
SELECT COUNT(*) FROM HISTDB.DBE."$hist_column_access_1";

If you configured your database correctly, then the output of this command includes a value that is different from zero. In our case, the result is as follows:

 COUNT |
-------+
2106717|

  1. Now we create a table on Netezza to use in the example. The table has three columns and a primary key.
DROP TABLE NZ_DATA4EXTRACTOR.CDC_DEMO IF EXISTS; 
CREATE TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
(
    ID INTEGER NOT NULL, 
    TS TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
    CMNT CHARACTER(16)
)
DISTRIBUTE ON RANDOM; 

ALTER TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
    ADD CONSTRAINT CDC_DEMO_PK PRIMARY KEY (ID); 

SELECT * 
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    ORDER BY ID;
  1. The SELECT statement returns an empty table:
ID|TS|CMNT|
--+--+----+
  1. Before we start the replication, we run the following query on Netezza to get the latest transaction identifier in the history table:
SELECT MAX(XID) AS XID 
    FROM HISTDB.DBE."$hist_plan_prolog_1";

For our test table, the script prints the last transaction identifier, which is 2691798:

XID    |
-------+
2691798|
  1. To make sure that our table doesn’t include new transactions, AWS SCT runs the following script. If you want to run this script manually, replace 2691798 with the last transaction identifier in your history table.
SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    WHERE CREATEXID > 2691798;

As expected, the script doesn’t return any values.

CREATEXID|
---------+
         |

CREATEXID and DELETEXID are hidden system columns that exist in every Netezza table. CREATEXID identifies the transaction ID that created the row, and DELETEXID identifies the transaction ID that deleted the row. AWS SCT uses them to find changes in the source data.

Now we’re ready to start the replication.

  1. We assume you’ve used AWS SCT to convert the example table and build it on the target Amazon Redshift. AWS SCT runs the following statement on Amazon Redshift:
DROP TABLE IF EXISTS nz_data4extractor_nz_data4extractor.cdc_demo;

CREATE TABLE nz_data4extractor_nz_data4extractor.cdc_demo
(
    id INTEGER ENCODE AZ64 NOT NULL,
    ts TIMESTAMP WITHOUT TIME ZONE ENCODE AZ64 DEFAULT SYSDATE::TIMESTAMP,
    cmnt CHARACTER VARYING(48) ENCODE LZO
)
DISTSTYLE AUTO;

ALTER TABLE nz_data4extractor_nz_data4extractor.cdc_demo
    ADD CONSTRAINT cdc_demo_pk PRIMARY KEY (id);
  1. AWS SCT also creates a staging table that holds replication changes until they can be applied on the actual target table:
CREATE TABLE IF NOT EXISTS "nz_data4extractor_nz_data4extractor"."_cdc_unit"
    (LIKE "nz_data4extractor_nz_data4extractor"."cdc_demo" INCLUDING  DEFAULTS);
ALTER TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ADD COLUMN deletexid_ BIGINT;
  1. AWS SCT runs the following query to capture all changes that happened after the last transaction identifier:
SET show_deleted_records = true;

SELECT 
    ID, 
    TS, 
    CMNT
FROM NZ_DATA4EXTRACTOR.CDC_DEMO
WHERE CREATEXID <= 2691798
    AND (DELETEXID = 0 OR DELETEXID > 2691798)

This script returns an empty table:

ID|TS|CMNT|
--+--+----+
  1. Now, we change data on Netezza and see how it gets replicated to Amazon Redshift:
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (1, 'One');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (2, 'Two');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (3, 'Three');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (4, 'Four');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (5, 'Five');

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The preceding script returns the following result:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 1|2023-03-07 14:05:11.000|One             |
 2|2023-03-07 14:05:11.000|Two             |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Four            |
 5|2023-03-07 14:05:11.000|Five            |
  1. AWS SCT checks for data changes starting from the last transaction ID using the following query:
SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691798

The script returns a result that is different from zero:

CREATEXID|
---------+
  2691824|

Because the new transaction ID is greater than the last transaction ID, the history table contains new data to be replicated.

  1. AWS SCT runs the following query to extract the changes. The application detects all rows that were inserted, deleted, or updated within the scope of transactions with IDs in range from 2691798 + 1 to 2691824.
SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691798 AS min_CDC_trx,
        2691824 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted data is as follows:

ID|TS                     |CMNT            |DELETEXID_|
--+-----------------------+----------------+----------+
 1|2023-03-07 14:05:11.000|One             |         0|
 2|2023-03-07 14:05:11.000|Two             |         0|
 3|2023-03-07 14:05:11.000|Three           |         0|
 4|2023-03-07 14:05:11.000|Four            |         0|
 5|2023-03-07 14:05:11.000|Five            |         0|
  1. Next, AWS SCT compresses the data and uploads it to Amazon S3. Then AWS SCT runs the following command to copy the data into the staging table on Amazon Redshift:
TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit"; 

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_1.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';
  1. From the staging table, AWS SCT applies the changes to the actual target table. For this iteration, we insert new rows into the Redshift table:
INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
        "id", 
        "ts", 
        "cmnt" 
    FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE t2.deletexid_ = 0;
  1. Let’s run another script that not only inserts, but also deletes and updates data in the source table:
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (6, 'Six');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (7, 'Seven');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (8, 'Eight');

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 1;
DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 7;

UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Two' WHERE ID = 2;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Replaced Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Yet Again Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Five' WHERE ID = 5;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Eight' WHERE ID = 8;

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 5;

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The Netezza table contains the following rows:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |
  1. AWS SCT detects the changes as before using the new transaction ID:
SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691824
CREATEXID|
---------+
  2691872|
SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691824 AS min_CDC_trx,
        2691872 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted changes appear as follows:

ID|TS                     |CMNT          |DELETEXID_|
--+-----------------------+--------------+----------+
 1|2023-03-07 14:05:11.000|One           |   2691856|
 2|2023-03-07 14:05:11.000|Two           |   2691860|
 2|2023-03-07 14:05:11.000|Updated Two   |         0|
 4|2023-03-07 14:05:11.000|Four          |   2691862|
 4|2023-03-07 14:05:11.000|Yet Again Four|         0|
 5|2023-03-07 14:05:11.000|Five          |   2691868|
 6|2023-03-07 14:07:09.000|Six           |         0|
 8|2023-03-07 14:07:10.000|Eight         |   2691870|
 8|2023-03-07 14:07:10.000|Updated Eight |         0|

Notice that we inserted a new row with ID 7 and then deleted this row. Therefore, we can ignore the row with ID 7 in our delta.

Also, we made several updates of the row with ID 4. In our delta, we include the original and the most recent versions of the row. We ignore all intermediate versions in our delta.

We updated the row with ID 5 and then deleted this row. We don’t include the updated row in our delta.

This way, AWS SCT optimizes the migrated data, reducing the migration time and the network traffic.

  1. Now, as before, AWS SCT compresses, uploads to Amazon S3, and copies the data into the staging Redshift table:
TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit";

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_2.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';  
  1. Then, AWS SCT applies the changes to the target table. AWS SCT removes the deleted rows, removes the old version of updated rows, and then inserts new rows and the most recent version of any updated rows:
DELETE 
    FROM  "nz_data4extractor_nz_data4extractor"."cdc_demo"
    USING "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE "nz_data4extractor_nz_data4extractor"."cdc_demo"."id" = t2."id"
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."ts" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."ts" AS VARCHAR),'?#-*')
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."cmnt" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."cmnt" AS VARCHAR),'?#-*')
        AND t2.deletexid_ > 0;
  
INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
    "id", 
    "ts", 
    "cmnt"
FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
WHERE t2.deletexid_ = 0;
  1. You can compare the data on the source and target to verify that AWS SCT captured all changes correctly:
SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;
ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |

The data on Amazon Redshift matches exactly:

id|ts                        |cmnt
 2|2023-03-07 14:05:11.000000|Updated Two
 3|2023-03-07 14:05:11.000000|Three
 4|2023-03-07 14:05:11.000000|Yet Again Four
 6|2023-03-07 14:07:09.000000|Six
 8|2023-03-07 14:07:10.000000|Updated Eight

In the previous examples, we showed how to run full load and CDC tasks. You can also create a CDC migration task without the full load. The process is the same—you provide AWS SCT with the transaction ID to start the replication from.

The CDC process does not have a significant impact on the source side. AWS SCT runs only SELECT statements there, using the transaction ID as boundaries for the WHERE clause. The performance impact of these statements is always smaller than the impact of DML statements generated by customer’s applications. For machines where AWS SCT data extraction agents are running, the CDC-related workload is always smaller than the full load workload because the volume of transferred data is smaller.

On the target side, for Amazon Redshift, the CDC process can generate considerable additional workload. The reason is that this process issues INSERT and DELETE statements and these can result in overhead for MPP systems, which Amazon Redshift is. Refer to Top 10 performance tuning techniques for Amazon Redshift to find best practices and tips on how to boost performance of your Redshift cluster.

Conclusion

In this post, we showed how to configure ongoing data replication for Netezza database migration to Amazon Redshift. You can use the described approach to automate data migration and replication from your IBM Netezza database to Amazon Redshift. Or, if you’re considering a migration of your existing Netezza workloads to Amazon Redshift, you can use AWS SCT to automatically convert your database schemas and migrate data. Download the latest version of AWS SCT and give it a try!

We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!


About the Authors

Mykhailo Kondak is a Database Engineer in the AWS Database Migration Service team at AWS. He uses his experience with different database technologies to help Amazon customers to move their on-premises data warehouses and big data workloads to the AWS Cloud. In his spare time, he plays soccer.

Illia Kravtsov is a Database Engineer on the AWS Database Migration Service team. He has over 10 years of experience in data warehouse development with Teradata and other massively parallel processing (MPP) databases.

Michael Soo is a Principal Database Engineer in the AWS Database Migration Service. He builds products and services that help customers migrate their database workloads to the AWS Cloud.

Automate marketing campaigns with real-time customer data using Amazon Pinpoint

Post Syndicated from Rushabh Lokhande original https://aws.amazon.com/blogs/messaging-and-targeting/automate-marketing-campaigns-with-real-time-customer-data-using-amazon-pinpoint/

Amazon Pinpoint offers marketers and developers one customizable tool to deliver customer communications across channels, segments, and campaigns at scale. Amazon Pinpoint makes it easy to run targeted campaigns and drive customer communications across different channels: email, SMS, push notifications, in-app messaging, or custom channels. Amazon Pinpoint campaigns enables you define which users to target, determine which messages to send, schedule the best time to deliver the messages, and then track the results of your campaign.

In many cases, the customer data resides in a third-party system such as a CRM, Customer Data Platform, Point of Sales, database and data warehouse. This customer data represents a valuable asset for your organization. Your marketing team needs to leverage each piece of this data to elevate the customer experience.

In this blog post we will demonstrate how you can leverage users’ clickstream data stored in database to build user segments and launch campaigns using Amazon Pinpoint. Also, we will showcase the full architecture of the data pipeline including other AWS services such as Amazon RDS, AWS Data Migration Service, Amazon Kinesis and AWS Lambda.

Let us understand our case study with an example: a customer currently has digital touch points such as a Website and a Mobile App to collect the users’ clickstreams and behavioral data where they are storing them in a MySQL database. Marketing teams want to leverage the collected data to deliver a personalized experience by leveraging Amazon Pinpoint capabilities.

You can find below the detail of a specific use case covered by the proposed solution:

  • All the clickstream and customer data are stored in MySQL DB
  • Your marketing team wants to create a personalized Amazon Pinpoint campaigns based on the user status and experience. Ex:
    • Customers who interested in specific offering to activate for them campaign based on their interest
    • Communicate with the preferred language of the user

Please note that this use case is used to showcase the proposed solution capabilities. However, it is not limited to this specific use case since you can leverage any customer collected dimension/attribute to create specific campaign to achieve a specific marketing use case.

In this post, we provide a guided journey on how marketers can collect, segment, and activate audience segments in real-time to increase their agility in managing campaigns.

Overview of solution

The use case covered in this post, focuses on demonstrating the flexibility offered by Amazon Pinpoint in both inbound (Ingestion) and outbound (Activation) stream of customer data. For the inbound stream, Amazon Pinpoint gives you a variety of ways to import your customer data, including:

  1. CSV/JSON import from the AWS console
  2. API operation to create a single or multiple endpoints
  3. Programmatically create and execute import jobs

We will focus on building a real-time inbound stream of customer data available within an Amazon RDS MySQL database specifically. It is important to mention that similar approach can be implemented to ingest data from third-party systems if any.

For the outbound stream, activating customer data using Amazon Pinpoint can be achieved using the following two methods:

  1. Campaign: a campaign is a messaging initiative that engages a specific audience segment.
  2. Journey: a journey is a customized, multi-step engagement experience.

The result of customer data activation cannot be completed without specifying the targeted channel. A channel represents the platform through which you engage your audience segment with messages. For example, Amazon Pinpoint customers can optimize how they target notifications to prospective customers through LINE message and email. They can deliver notifications with more information on prospected customer’s product information such as sales, new products etc. to the appropriate audience.

Amazon Pinpoint supports the following channels:

  • Push notifications
  • Email
  • SMS
  • Voice
  • In-app messages

In addition to these channels, you can also extend the capabilities to meet your specific use case by creating custom channels. You can use custom channels to send messages to your customers through any service that has an API including third-party services. For example, you can use custom channels to send messages through third-party services such as WhatsApp or Facebook Messenger. We will focus on developing an Amazon Pinpoint connector using custom channel to target your customers on third-party services through API.

Solution Architecture

The below diagram illustrates the proposed architecture to address the use case. Moving from left to right:

Fig 1: Architecture Diagram for the Solution

Fig 1: Architecture Diagram for the Solution

  1. Amazon RDS: This hosts customer database where you can have one or many tables contains customer data.
  2. AWS Data Migration Service (DMS): This acts as the glue between Amazon RDS MySQL and the downstream services by replicating any transformation that happens at the record level in the configured customer tables.
  3. Amazon Kinesis Data Streams: This is the destination endpoint for AWS DMS. It will carry all the transformed records for the next stage of the pipeline.
  4. AWS Lambda (inbound): The inbound AWS Lambda triggers the Kinesis Data Streams, process the mutated records, and ingest them in Amazon Pinpoint.
  5. Amazon Pinpoint: This act as the centralized place to define customer segments and launch campaigns.
  6. AWS Lambda (outbound): This act as the custom channel destination for the campaigns activated from Amazon Pinpoint.

To illustrate how to set up this architecture, we’ll walk you through the following steps:

  1. Deploying an AWS CDK stack to provision the following AWS Resources
  2. Validate the Deployment.
  3. Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.
  4. Cleaning up your resources.

Prerequisites

Make sure that you complete the following steps as prerequisites:

The Solution

Launching your AWS CDK Stack

Step 1a: Open your device’s command line or Terminal.

Step1b: Checkout Git repository to a local directory on your device:

git clone https://github.com/aws-samples/amazon-pinpoint-realtime-campaign-optimization-example.git

Step 2: Change directories to the new directory code location:

cd amazon-pinpoint-realtime-campaign-optimization-example

Step 3: Update your AWS account number and region:

  1. Edit config.py with your choice to tool or command line
  2. look for section “Account Setup” and update your account number and region

    Fig 2: Configuring config.py for account-id and region

    Fig 2: Configuring config.py for account-id and region

  3. look for section “VPC Parameters” and update your VPC and subnet info

    Fig 3: Configuring config.py for VPC and subnet information

    Fig 3: Configuring config.py for VPC and subnet information

Step 4: Verify if you are in the directory where app.py file is located:

ls -ltr app.py

Step 5: Create a virtual environment:

macOS/Linux:

python3 -m venv .env

Windows:

python -m venv .env

Step 6: Activate the virtual environment after the init process completes and the virtual environment is created:

macOS/Linux:

source .env/bin/activate

Windows:

.env\Scripts\activate.bat

Step 7: Install the required dependencies:

pip3 install -r requirements.txt

Step 8: Bootstrap the cdk app using the following command:

cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>

Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.
This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.

Fig 4: Bootstrapping CDK environment

Fig 4: Bootstrapping CDK environment

Please note, if you have already bootstrapped the same account previously, you cannot bootstrap account, in such case skip this step or use a new AWS account.

Step 9: Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. AWS CDK apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:

cdk synthesize

Step 10: Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the –require-approval never flag:

cdk deploy "*" --require-approval never

While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal.

Fig 5: AWS CDK Deployment progress in terminal

Fig 5: AWS CDK Deployment progress in terminal

Once the deployment is successful, you’ll see the successful status as follows:

Fig 6: AWS CDK Deployment completion success

Fig 6: AWS CDK Deployment completion success

Step 11: Log in to the AWS Console, go to CloudFormation, and see the output of the ApplicationStack:

Fig 7: AWS CloudFormation stack output

Fig 7: AWS CloudFormation stack output

Note the values of PinpointProjectId, PinpointProjectName, and RDSSecretName variables. We’ll use them in the next step to upload our artifacts

Testing The Solution

In this section we will create a full data flow using the below steps:

  1. Ingest data in the customer_tb table within the Amazon RDS MySQL DB instance
  2. Validate that AWS Data Migration Service created task is replicating the changes to the Amazon Kinesis Data Streams
  3. Validate that endpoints are created within Amazon Pinpoint
  4. Create Amazon Pinpoint Segment and Campaign and activate data to Webhook.site endpoint URL

Step 1: Connect to MySQL DB instance and create customer database

    1. Sign in to the AWS Management Console and open the AWS Cloud9 console at https://console.aws.amazon.com/cloud9 
    2. Click Create environment
      • Name: mysql-cloud9-01 (for example)
      • Click Next
      • Environment type: Create a new EC2 instance for environment (direct access)
      • Instance type: t2.micro
      • Timeout: 30 minutes
      • Platform: Amazon Linux 2
      • Network settings under VPC settings select the same VPC where the MySQL DB instance was created. (this is the same VPC and Subnet from step 3.3)
      • Click Next
      • Review and click Create environment
    3. Select the created AWS Cloud9 from the AWS Cloud9 console at https://console.aws.amazon.com/cloud9  and click Open in Cloud9. You will have access to AWS Cloud9 Linux shell.
    4. From Linux shell, update the operating system and :
      sudo yum update -y
    5. From Linux shell, update the operating system and :
      sudo yum install -y mysql
    6. To connect to the created MySQL RDS DB instance, use the below command in the AWS Cloud9 Linux shell:
      mysql -h <<host>> -P 3308 --user=<<username>> --password=<<password>>
      • To get values for dbInstanceIdentifier, username, and password
        • Navigate to the AWS Secrets Manager service
        • Open the secret with the name created by the CDK application
        • Select ‘Reveal secret value’ and copy the respective values and replace in your command
      • After you enter the password for the user, you should see output similar to the following.
      • Welcome to the MariaDB monitor.  Commands end with ; or \g.
        Your MySQL connection id is 27
        Server version: 8.0.32 Source distribution
        Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
        Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
        MySQL [(none)]>

Step 2: Ingest data in the customer_tb table within the Amazon RDS MySQL DB instance Once the connection to the MySQL DB instance established, using the same AWS Cloud9 Linux shell connected to the MySQL RDS DB execute following commands.

  • Create database pinpoint-test-db:
    CREATE DATABASE `pinpoint-test-db`;
  • Create table customer-tb:
    Use `pinpoint-test-db`;
    CREATE TABLE `customer_tb` (`userid` int NOT NULL,
                                `email` varchar(150) DEFAULT NULL,
                                `language` varchar(45) DEFAULT NULL,
                                `favourites` varchar(250) DEFAULT NULL,
                                PRIMARY KEY (`userid`);
  • You can verify the schema using the below SQL command:
  1. DESCRIBE `pinpoint-test-db`.customer_tb;
    Fig 8: Verify schema for customer_db table

    Fig 8: Verify schema for customer_db table

    • Insert records in customer_tb table:
    Use `pinpoint-test-db`;
    insert into customer_tb values (1,'[email protected]','english','football');
    insert into customer_tb values (2,'[email protected]','english','basketball');
    insert into customer_tb values (3,'[email protected]','french','football');
    insert into customer_tb values (4,'[email protected]','french','football');
    insert into customer_tb values (5,'[email protected]','french','basketball');
    insert into customer_tb values (6,'[email protected]','french','football');
    insert into customer_tb values (7,'[email protected]','french',null);
    insert into customer_tb values (8,'[email protected]','english','football');
    insert into customer_tb values (9,'[email protected]','english','football');
    insert into customer_tb values (10,'[email protected]','english',null);
    • Verify records in customer_tb table:
    select * from `pinpoint-test-db`.`customer_tb`;
    Fig 9: Verify data for customer_db table

    Fig 9: Verify data for customer_db table

    Step 3: Validate that AWS Data Migration Service created task is replicating the changes to the Amazon Kinesis Data Streams

      1. Sign in to the AWS Management Console and open the AWS DMS console at https://console.aws.amazon.com/dms/v2 
      2. From the navigation panel, choose Database migration tasks.
      3. Click on the created task created by CDK code ‘dmsreplicationtask-*’
      4. Start the replication task

        Fig 10: Starting AWS DMS Replication Task

        Fig 10: Starting AWS DMS Replication Task

      5. Make sure that Status is Replication ongoing

        Fig 11: AWS DMS Replication statistics

        Fig 11: AWS DMS Replication statistics

      6. Navigate to Table Statistics and make sure that the number of Inserts is equal to 10 and Load state is Table completed*

        Fig 12: AWS DMS Replication statistics

        Fig 12: AWS DMS Replication statistics

    Step 4: Validate that endpoints are created within Amazon Pinpoint

    1. Sign in to the AWS Management Console and open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/ 
    2. Click on Amazon Pinpoint Project Demo created by CDK stack “dev-pinpoint-project”
    3. From the left menu, click on Analytics and validate that the Active targetable endpoints are equal to 10 as shown below:
    Fig 13: Amazon Pinpoint endpoint summary

    Fig 13: Amazon Pinpoint endpoint summary

    Step 5: Create Amazon Pinpoint Segment and Campaign

    Step 5.1: Create Amazon Pinpoint Segment

    • Sign in to the AWS Management Console and open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/ 
    • Click on Amazon Pinpoint Project Demo created by CDK stack “dev-pinpoint-project”
    • from the left menu, click on Segments and click Create a segment
    • create Segment using the below configurations:
      • Name: English Speakers
      • Under criteria:
    • Attribute: Language
    • Operator: Conatins
    • Value: english
    Fig 14: Amazon Pinpoint segment summary

    Fig 14: Amazon Pinpoint segment summary

    • Click create segment

    Step 5.2: Create Amazon Pinpoint Campaign

    • from the left menu, click on Campaigns and click Create a campaign
    • set the Campaign name to test campaign and select Custom option for Channel as shown below:
    Fig 15: Amazon Pinpoint create campaign

    Fig 15: Amazon Pinpoint create campaign

    • Click Next
    • Select English Speakers from Segment as shown below and click Next:
    Fig 16: Amazon Pinpoint segment

    Fig 16: Amazon Pinpoint segment

    • Choose Lambda function channel type and select outbound lambda function with name pattern as ApplicationStack-lambdaoutboundfunction* from the dropdown as shown below:
    Fig 17: Amazon Pinpoint message creation

    Fig 17: Amazon Pinpoint message creation

    • Click Next
    • Choose At a specific time option and immediately to send the campaign as show below:
    Fig 18: Amazon Pinpoint campaign scheduling

    Fig 18: Amazon Pinpoint campaign scheduling

    If you push more messages or records into Amazon RDS (from step 2.4), you will need to create a new campaign (from step 4.2) to process the new messages.

    • Click Next, review the configuration and click Launch campaign.
    • Navigate to dev-pinpoint-project and select the campaign created in previous step. You should see status as ‘Complete’
    Fig 19: Amazon Pinpoint campaign processing status

    Fig 19: Amazon Pinpoint campaign processing status

    • Navigate to dev-pinpoint-project dashboard and select your campaign in ‘Campaign metrics’ dashboard, you will see the statistics for the processing.
    Fig 20: Amazon Pinpoint campaign metrics

    Fig 20: Amazon Pinpoint campaign metrics

    Accomplishments

    This is a quick summary of what we accomplished:

    1. Created an Amazon RDS MySQL DB instance and define customer_tb table schema
    2. Created an Amazon Kinesis Data Stream
    3. Replicated database changes from the Amazon RDS MySQL DB to Amazon Kinesis Data Stream
    4. Created an AWS Lambda function triggered by Amazon Kinesis Data Stream to ingest database records in Amazon Pinpoints as User endpoints using AWS SDK
    5. Created an Amazon Pinpoint Project, segment and campaign
    6. Created an AWS Lambda function as custom channel for Amazon Pinpoint campaign
    7. Tested end-to-end data flow from Amazon RDS MySQL DB instance to third party endpoint

    Next Steps

    You have now gained a good understanding of Amazon Pinpoint agnostic data flow but there are still many areas left for exploration. What this workshop hasn’t covered is the operation of other communication channels such as Email, SMS, Push notification and Voice outbound. You can enable the channels that are pertinent to your use case and send messages using campaigns or journeys.

    Clean Up

    Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the CloudFormation console.

    To destroy the resources using AWS CDK, follow these steps:

    • Follow Steps 1-5 from the ‘Launching your CDK Stack’ section.
    • Destroy the app by executing the following command:
    cdk destroy

    Summary

    In this post, you have now gained a good understanding of Amazon Pinpoint flexible real-time data flow. By implementing the steps detailed in this blog post, you can achieve a seamless integration of your customer data from Amazon RDS MySQL database to Amazon Pinpoint where you can leverage segments and campaigns to activate data using custom channels to third-party services via API. The demonstrated use case focuses on Amazon RDS MySQL database as a data source. However, there are still many areas left for exploration. What this post hasn’t covered is the operation of integrating customer data from other type of data sources such as MongoDB, Microsoft SQL Server, Google Cloud, etc. Also, other communication channels such as Email, SMS, Push notification and Voice outbound can be used in the activation layer. You can enable the channels that are pertinent to your use case and send messages using campaigns or journeys, and get a complete view of their customers across all touchpoints and can lead to less relevant marketing campaigns.

    About the Authors

  2. Bret Pontillo is a Senior Data Architect with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.
    Rushabh Lokhande is a Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and golf.
    Ghandi Nader is a Senior Partner Solution Architect focusing on the Adtech and Martech industry. He helps customers and partners innovate and align with the market trends related to the industry. Outside of work, he enjoys spending time cycling and watching formula one.

Orchestrate Amazon EMR Serverless jobs with AWS Step functions

Post Syndicated from Naveen Balaraman original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-emr-serverless-jobs-with-aws-step-functions/

Amazon EMR Serverless provides a serverless runtime environment that simplifies the operation of analytics applications that use the latest open source frameworks, such as Apache Spark and Apache Hive. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks. You can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless automatically scales resources up and down to provide just the right amount of capacity for your application, and you only pay for what you use.

AWS Step Functions is a serverless orchestration service that enables developers to build visual workflows for applications as a series of event-driven steps. Step Functions ensures that the steps in the serverless workflow are followed reliably, that the information is passed between stages, and errors are handled automatically.

The integration between AWS Step Functions and Amazon EMR Serverless makes it easier to manage and orchestrate big data workflows. Before this integration, you had to manually poll for job statuses or implement waiting mechanisms through API calls. Now, with the support for “Run a Job (.sync)” integration, you can more efficiently manage your EMR Serverless jobs. Using .sync allows your Step Functions workflow to wait for the EMR Serverless job to complete before moving on to the next step, effectively making job execution part of your state machine. Similarly, the “Request Response” pattern can be useful for triggering a job and immediately getting a response back, all within the confines of your Step Functions workflow. This integration simplifies your architecture by eliminating the need for additional steps to monitor job status, making the whole system more efficient and easier to manage.

In this post, we explain how you can orchestrate a PySpark application using Amazon EMR Serverless and AWS Step Functions. We run a Spark job on EMR Serverless that processes Citi Bike dataset data in an Amazon Simple Storage Service (Amazon S3) bucket and stores the aggregated results in Amazon S3.

Solution Overview

We demonstrate this solution with an example using the Citi Bike dataset. This dataset includes numerous parameters such as Rideable type, Start station, Started at, End station, Ended at, and various other elements about Citi Bikers ride. Our objective is to find the minimum, maximum, and average bike trip duration in a given month.

In this solution, the input data is read from the S3 input path, transformations and aggregations are applied with the PySpark code, and the summarized output is written to the S3 output path s3://<bucket-name>/serverlessout/.

The solution is implemented as follows:

  • Creates an EMR Serverless application with Spark runtime. After the application is created, you can submit the data-processing jobs to that application. This API step waits for Application creation to complete.
  • Submits the PySpark job and waits for its completion with the StartJobRun (.sync) API. This allows you to submit a job to an Amazon EMR Serverless application and wait until the job completes.
  • After the PySpark job completes, the summarized output is available in the S3 output directory.
  • If the job encounters an error, the state machine workflow will indicate a failure. You can inspect the specific error within the state machine. For a more detailed analysis, you can also check the EMR job failure logs in the EMR studio console.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • An AWS account
  • An IAM user with administrator access
  • An S3 bucket

Solution Architecture

To automate the complete process, we use the following architecture, which integrates Step Functions for orchestration and Amazon EMR Serverless for data transformations. Summarized output is then written to Amazon S3 bucket.

The following diagram illustrates the architecture for this use case

Deployment steps

Before beginning this tutorial, ensure that the role being used to deploy has all the relevant permissions to create the required resources as part of the solution. The roles with the appropriate permissions will be created through a CloudFormation template using the following steps.

Step 1: Create a Step Functions state machine

You can create a Step Functions State Machine workflow in two ways— either through the code directly or through the Step Functions studio graphical interface. To create a state machine, you can follow the steps from either option 1 or option 2 below.

Option 1: Create the state machine through code directly

To create a Step Functions state machine along with the necessary IAM roles, complete the following steps:

  1. Launch the CloudFormation stack using this link. On the Cloud Formation console, provide a stack name and accept the defaults to create the stack. Once the CloudFormation deployment completes, the following resources are created, in addition EMR Service Linked Role will be automatically created by this CloudFormation stack to access EMR Serverless:
    • S3 bucket to upload the PySpark script and write output data from EMR Serverless job. We recommend enabling default encryption on your S3 bucket to encrypt new objects, as well as enabling access logging to log all requests made to the bucket. Following these recommendations will improve security and provide visibility into access of the bucket.
    • EMR Serverless Runtime role that provides granular permissions to specific resources that are required when EMR Serverless jobs run.
    • Step Functions Role to grant AWS Step Functions permissions to access the AWS resources that will be used by its state machines.
    • State Machine with EMR Serverless steps.

  1. To prepare the S3 bucket with PySpark script, open AWS Cloudshell from the toolbar on the top right corner of AWS console and run the following AWS CLI command in CloudShell (make sure to replace <<ACCOUNT-ID>> with your AWS Account ID):

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-3594/bikeaggregator.py s3://serverless-<<ACCOUNT-ID>>-blog/scripts/

  1. To prepare the S3 bucket with Input data, run the following AWS CLI command in CloudShell (make sure to replace <<ACCOUNT-ID>> with your AWS Account ID):

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-3594/201306-citibike-tripdata.csv s3://serverless-<<ACCOUNT-ID>>-blog/data/ --copy-props none

Option 2: Create the Step Functions state machine through Workflow Studio

Prerequisites

Before creating the State Machine though Workshop Studio, please ensure that all the relevant roles and resources are created as part of the solution.

  1. To deploy the necessary IAM roles and S3 bucket into your AWS account, launch the CloudFormation stack using this link. Once the CloudFormation deployment completes, the following resources are created:
    • S3 bucket to upload the PySpark script and write output data. We recommend enabling default encryption on your S3 bucket to encrypt new objects, as well as enabling access logging to log all requests made to the bucket. Following these recommendations will improve security and provide visibility into access of the bucket.
    • EMR Serverless Runtime role that provides granular permissions to specific resources that are required when EMR Serverless jobs run.
    • Step Functions Role to grant AWS Step Functions permissions to access the AWS resources that will be used by its state machines.

  1. To prepare the S3 bucket with PySpark script, open AWS Cloudshell from the toolbar on the top right of the AWS console and run the following AWS CLI command in CloudShell (make sure to replace <<ACCOUNT-ID>> with your AWS Account ID):

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-3594/bikeaggregator.py s3://serverless-<<ACCOUNT-ID>>-blog/scripts/

  1. To prepare the S3 bucket with Input data, run the following AWS CLI command in CloudShell (make sure to replace <<ACCOUNT-ID>> with your AWS Account ID):

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-3594/201306-citibike-tripdata.csv s3://serverless-<<ACCOUNT-ID>>-blog/data/ --copy-props none

To create a Step Functions state machine, complete the following steps:

  1. On the Step Functions console, choose Create state machine.
  2. Keep the Blank template selected, and click Select.
  3. In the Actions Menu on the left, Step Functions provides a list of AWS services APIs that you can drag and drop into your workflow graph in the design canvas. Type EMR Serverless in the search and drag the Amazon EMR Serverless CreateApplication state to the workflow graph:

  1. In the canvas, select Amazon EMR Serverless CreateApplication state to configure its properties. The Inspector panel on the right shows configuration options. Provide the following Configuration values:
    • Change the State name to Create EMR Serverless Application
    • Provide the following values to the API Parameters. This creates an EMR Serverless Application with Apache Spark based on Amazon EMR release 6.12.0 using default configuration settings.
      {
          "Name": "ServerlessBikeAggr",
          "ReleaseLabel": "emr-6.12.0",
          "Type": "SPARK"
      }

    • Click the Wait for task to complete – optional check box to wait for EMR Serverless Application creation state to complete before executing the next state.
    • Under Next state, select the Add new state option from the drop-down.
  2. Drag EMR Serverless StartJobRun state from the left browser to the next state in the workflow.
    • Rename State name to Submit PySpark Job
    • Provide the following values in the API parameters and click Wait for task to complete – optional (make sure to replace <<ACCOUNT-ID>> with your AWS Account ID).
{
"ApplicationId.$": "$.ApplicationId",
    "ExecutionRoleArn": "arn:aws:iam::<<ACCOUNT-ID>>:role/EMR-Serverless-Role-<<ACCOUNT-ID>>",
    "JobDriver": {
        "SparkSubmit": {
            "EntryPoint": "s3://serverless-<<ACCOUNT-ID>>-blog/scripts/bikeaggregator.py",
            "EntryPointArguments": [
                "s3://serverless-<<ACCOUNT-ID>>-blog/data/",
                "s3://serverless-<<ACCOUNT-ID>>-blog/serverlessout/"
            ],
            "SparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
    }
}

  1. Select the Config tab for the state machine from the top and change the following configurations:
    • Change State machine name to EMRServerless-BikeAggr found in Details.
    • In the Permissions section, select StateMachine-Role-<<ACCOUNT-ID>> from the dropdown for Execution role. (Make sure that you replace <<ACCOUNT-ID>> with your AWS Account ID).
  2. Continue to add steps for Check Job Success from the studio as shown in the following diagram.

  1. Click Create to create the Step Functions State Machine for orchestrating the EMR Serverless jobs.

Step 2: Invoke the Step Functions

Now that the Step Function is created, we can invoke it by clicking on the Start execution button:

When the step function is being invoked, it presents its run flow as shown in the following screenshot. Because we have selected Wait for task to complete config (.sync API) for this step, the next step would not start wait until EMR Serverless Application is created (blue represents the Amazon EMR Serverless Application being created).

After successfully creating the EMR Serverless Application, we submit a PySpark Job to that Application.

When the EMR Serverless job completes, the Submit PySpark Job step changes to green. This is because we have selected the Wait for task to complete configuration (using the .sync API) for this step.

The EMR Serverless Application ID as well as PySpark Job run Id from Output tab for Submit PySpark Job step.

Step 3: Validation

To confirm the successful completion of the job, navigate to EMR Serverless console and find the EMR Serverless Application Id. Click the Application Id to find the execution details for the PySpark Job run submitted from the Step Functions.

To verify the output of the job execution, you can check the S3 bucket where the output will be stored in a .csv file as shown in the following graphic.

Cleanup

Log in to the AWS Management Console and delete any S3 buckets created by this deployment to avoid unwanted charges to your AWS account. For example: s3://serverless-<<ACCOUNT-ID>>-blog/

Then clean up your environment, delete the CloudFormation template you created in the Solution configuration steps.

Delete Step function you created as part of this solution.

Conclusion

In this post, we explained how to launch an Amazon EMR Serverless Spark job with Step Functions using Workflow Studio to implement a simple ETL pipeline that creates aggregated output from the Citi Bike dataset and generate reports.

We hope this gives you a great starting point for using this solution with your datasets and applying more complex business rules to solve your transient cluster use cases.

Do you have follow-up questions or feedback? Leave a comment. We’d love to hear your thoughts and suggestions.

References


About the Authors

Naveen Balaraman is a Sr Cloud Application Architect at Amazon Web Services. He is passionate about Containers, Serverless, Architecting Microservices and helping customers leverage the power of AWS cloud.

Karthik Prabhakar is a Senior Big Data Solutions Architect for Amazon EMR at AWS. He is an experienced analytics engineer working with AWS customers to provide best practices and technical advice in order to assist their success in their data journey.

Parul Saxena is a Big Data Specialist Solutions Architect at Amazon Web Services, focused on Amazon EMR, Amazon Athena, AWS Glue and AWS Lake Formation, where she provides architectural guidance to customers for running complex big data workloads over AWS platform. In her spare time, she enjoys traveling and spending time with her family and friends.

Now available: Building a scalable vulnerability management program on AWS

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/now-available-how-to-build-a-scalable-vulnerability-management-program-on-aws/

Vulnerability findings in a cloud environment can come from a variety of tools and scans depending on the underlying technology you’re using. Without processes in place to handle these findings, they can begin to mount, often leading to thousands to tens of thousands of findings in a short amount of time. We’re excited to announce the Building a scalable vulnerability management program on AWS guide, which includes how you can build a structured vulnerability management program, operationalize tooling, and scale your processes to handle a large number of findings from diverse sources.

Building a scalable vulnerability management program on AWS focuses on the fundamentals of building a cloud vulnerability management program, including traditional software and network vulnerabilities and cloud configuration risks. The guide covers how to build a successful and scalable vulnerability management program on AWS through preparation, enabling and configuring tools, triaging findings, and reporting.

Targeted outcomes

This guide can help you and your organization with the following:

  • Develop policies to streamline vulnerability management and maintain accountability.
  • Establish mechanisms to extend the responsibility of security to your application teams.
  • Configure relevant AWS services according to best practices for scalable vulnerability management.
  • Identify patterns for routing security findings to support a shared responsibility model.
  • Establish mechanisms to report on and iterate on your vulnerability management program.
  • Improve security finding visibility and help improve overall security posture.

Using the new guide

We encourage you to read the entire guide before taking action or building a list of changes to implement. After you read the guide, assess your current state compared to the action items and check off the items that you’ve already completed in the Next steps table. This will help you assess the current state of your AWS vulnerability management program. Then, plan short-term and long-term roadmaps based on your gaps, desired state, resources, and business needs. Building a cloud vulnerability management program often involves iteration, so you should prioritize key items and regularly revisit your backlog to keep up with technology changes and your business requirements.

Further information

For more information and to get started, see the Building a scalable vulnerability management program on AWS.

We greatly value feedback and contributions from our community. To share your thoughts and insights about the guide, your experience using it, and what you want to see in future versions, select Provide feedback at the bottom of any page in the guide and complete the form.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Anna McAbee

Anna is a Security Specialist Solutions Architect focused on threat detection and incident response at AWS. Before AWS, she worked as an AWS customer in financial services on both the offensive and defensive sides of security. Outside of work, Anna enjoys cheering on the Florida Gators football team, wine tasting, and traveling the world.

Author

Megan O’Neil

Megan is a Principal Security Specialist Solutions Architect focused on Threat Detection and Incident Response. Megan and her team enable AWS customers to implement sophisticated, scalable, and secure solutions that solve their business challenges. Outside of work, Megan loves to explore Colorado, including mountain biking, skiing, and hiking.

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Post Syndicated from Bhupesh Sharma original https://aws.amazon.com/blogs/big-data/modernize-a-legacy-real-time-analytics-application-with-amazon-managed-service-for-apache-flink/

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. To reap the benefits of cloud computing, like increased agility and just-in-time provisioning of resources, organizations are migrating their legacy analytics applications to AWS. The lift and shift migration approach is limited in its ability to transform businesses because it relies on outdated, legacy technologies and architectures that limit flexibility and slow down productivity. In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink.

Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities. In this post, we discuss challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions. We introduce you to Amazon Managed Service for Apache Flink Studio and get started querying streaming data interactively using Amazon Kinesis Data Streams.

Solution overview

In this post, we walk through a call center analytics solution that provides insights into the call center’s performance in near-real time through metrics that determine agent efficiency in handling calls in the queue. Key performance indicators (KPIs) of interest for a call center from a near-real-time platform could be calls waiting in the queue, highlighted in a performance dashboard within a few seconds of data ingestion from call center streams. These metrics help agents improve their call handle time and also reallocate agents across organizations to handle pending calls in the queue.

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases. However, as data loses its relevance with time, transformations in a near-real-time analytics platform need only the latest data from the streams to generate insights. This may require frequent truncation in certain tables to retain only the latest stream of events. Also, the need to derive near-real-time insights within seconds requires frequent materialized view refreshes in this traditional relational database approach. Frequent materialized view refreshes on top of constantly changing base tables due to streamed data can lead to snapshot isolation errors. Also, a data model that allows table truncations at a regular frequency (for example, every 15 seconds) to store only relevant data in tables can cause locking and performance issues.

The following diagram provides the high-level architecture of a legacy call center analytics platform. In this traditional architecture, a relational database is used to store data from streaming data sources. Datasets used for generating insights are curated using materialized views inside the database and published for business intelligence (BI) reporting.

Modernizing this traditional database-driven architecture in the AWS Cloud allows you to use sophisticated streaming technologies like Amazon Managed Service for Apache Flink, which are built to transform and analyze streaming data in real time. With Amazon Managed Service for Apache Flink, you can gain actionable insights from streaming data with serverless, fully managed Apache Flink. You can use Amazon Managed Service for Apache Flink to quickly build end-to-end stream processing applications and process data continuously, getting insights in seconds or minutes. With Amazon Managed Service for Apache Flink, you can use Apache Flink code or Flink SQL to continuously generate time-series analytics over time windows and perform sophisticated joins across streams.

The following architecture diagram illustrates how a legacy call center analytics platform running on databases can be modernized to run on the AWS Cloud using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. You can perform sophisticated joins over these streaming datasets and create views on top of it using Amazon Managed Service for Apache Flink to generate KPIs required for the business using Amazon OpenSearch Service. You can analyze streaming data interactively using managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio in near-real time. The near-real-time insights can then be visualized as a performance dashboard using OpenSearch Dashboards.

In this post, you perform the following high-level implementation steps:

  1. Ingest data from streaming data sources to Kinesis Data Streams.
  2. Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the stream data within seconds of data ingestion.
  3. Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards.

Prerequisites

This post requires you to set up the Amazon Kinesis Data Generator (KDG) to send data to a Kinesis data stream using an AWS CloudFormation template. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

We use two datasets in this post. The first dataset is fact data, which contains call center organization data. The KDG generates a fact data feed in every 15 seconds that contains the following information:

  • AgentId – Agents work in a call center setting surrounded by other call center employees answering customers’ questions and referring them to the necessary resources to solve their problems.
  • OrgId – A call center contains different organizations and departments, such as Care Hub, IT Hub, Allied Hub, Prem Hub, Help Hub, and more.
  • QueueId – Call queues provide an effective way to route calls using simple or sophisticated patterns to ensure that all calls are getting into the correct hands quickly.
  • WorkMode – An agent work mode determines your current state and availability to receive incoming calls from the Automatic Call Distribution (ACD) and Direct Agent Call (DAC) queues. Call Center Elite does not route ACD and DAC calls to your phone when you are in an Aux mode or ACW mode.
  • WorkSkill – Working as a call center agent requires several soft skills to see the best results, like problem-solving, bilingualism, channel experience, aptitude with data, and more.
  • HandleTime – This customer service metric measures the length of a customer’s call.
  • ServiceLevel – The call center service level is defined as the percentage of calls answered within a predefined amount of time—the target time threshold. It can be measured over any period of time (such as 30 minutes, 1 hour, 1 day, or 1 week) and for each agent, team, department, or the company as a whole.
  • WorkStates – This specifies what state an agent is in. For example, an agent in an available state is available to handle calls from an ACD queue. An agent can have several states with respect to different ACD devices, or they can use a single state to describe their relationship to all ACD devices. Agent states are reported in agent-state events.
  • WaitingTime – This is the average time an inbound call spends waiting in the queue or waiting for a callback if that feature is active in your IVR system.
  • EventTime – This is the time when the call center stream is sent (via the KDG in this post).

The following fact payload is used in the KDG to generate sample fact data:

{
"AgentId" : {{random.number(
{
"min":2001,
"max":2005
}
)}},
"OrgID" : {{random.number(
{
"min":101,
"max":105
}
)}},
"QueueId" : {{random.number(
{
"min":1,
"max":5
}
)}},
"WorkMode" : "{{random.arrayElement(
["TACW","ACW","AUX"]
)}}",
"WorkSkill": "{{random.arrayElement(
["Problem Solving","Bilingualism","Channel experience","Aptitude with data"]
)}}",
"HandleTime": {{random.number(
{
"min":1,
"max":150
}
)}},
"ServiceLevel":"{{random.arrayElement(
["Sev1","Sev2","Sev3"]
)}}",
"WorkSates": "{{random.arrayElement(
["Unavailable","Available","On a call"]
)}}",
"WaitingTime": {{random.number(
{
"min":10,
"max":150
}
)}},
"EventTime":"{{date.utc("YYYY-MM-DDTHH:mm:ss")}}"
}

The following screenshot shows the output of the sample fact data in an Amazon Managed Service for Apache Flink notebook.

The second dataset is dimension data. This data contains metadata information like organization names for their respective organization IDs, agent names, and more. The frequency of the dimension dataset is twice a day, whereas the fact dataset gets loaded in every 15 seconds. In this post, we use Amazon Simple Storage Service (Amazon S3) as a data storage layer to store metadata information (Amazon DynamoDB can be used to store metadata information as well). We use AWS Lambda to load metadata from Amazon S3 to another Kinesis data stream that stores metadata information. The following JSON file stored in Amazon S3 has metadata mappings to be loaded into the Kinesis data stream:

[{"OrgID": 101,"OrgName" : "Care Hub","Region" : "APAC"},
{"OrgID" : 102,"OrgName" : "Prem Hub","Region" : "AMER"},
{"OrgID" : 103,"OrgName" : "IT Hub","Region" : "EMEA"},
{"OrgID" : 104,"OrgName" : "Help Hub","Region" : "EMEA"},
{"OrgID" : 105,"OrgName" : "Allied Hub","Region" : "LATAM"}]

Ingest data from streaming data sources to Kinesis Data Streams

To start ingesting your data, complete the following steps:

  1. Create two Kinesis data streams for the fact and dimension datasets, as shown in the following screenshot. For instructions, refer to Creating a Stream via the AWS Management Console.

  1. Create a Lambda function on the Lambda console to load metadata files from Amazon S3 to Kinesis Data Streams. Use the following code:
    import boto3
    import json
    
    # Create S3 object
    s3_client = boto3.client("s3")
    S3_BUCKET = '<S3 Bucket Name>'
    kinesis_client = boto3.client("kinesis")
    stream_name = '<Kinesis Stream Name>'
        
    def lambda_handler(event, context):
      
      # Read Metadata file on Amazon S3
      object_key = "<S3 File Name.json>"  
      file_content = s3_client.get_object(
          Bucket=S3_BUCKET, Key=object_key)["Body"].read()
      
      #Decode the S3 object to json
      decoded_data = file_content.decode("utf-8").replace("'", '"')
      json_data = json.dumps(decoded_data)
      
      #Upload json data to Kinesis data stream
      partition_key = 'OrgID'
      for record in json_data:
        response = kinesis_client.put_record(
          StreamName=stream_name,
          Data=json.dumps(record),
          PartitionKey=partition_key)

Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the streaming data

The next step is to create tables in Amazon Managed Service for Apache Flink Studio for further transformations (joins, aggregations, and so on). To set up and query Kinesis Data Streams using Amazon Managed Service for Apache Flink Studio, refer to Query your data streams interactively using Amazon Managed Service for Apache Flink Studio and Python and create an Amazon Managed Service for Apache Flink notebook. Then complete the following steps:

  1. In the Amazon Managed Service for Apache Flink Studio notebook, create a fact table from the facts data stream you created earlier, using the following query.

The event time attribute is defined using a WATERMARK statement in the CREATE table DDL. A WATERMARK statement defines a watermark generation expression on an existing event time field, which marks the event time field as the event time attribute.

The event time refers to the processing of streaming data based on timestamps that are attached to each row. The timestamps can encode when an event happened. Processing time (PROCTime) refers to the machine’s system time that is running the respective operation.

%flink.ssql
CREATE TABLE <Fact Table Name> (
AgentId INT,
OrgID INT,
QueueId BIGINT,
WorkMode VARCHAR,
WorkSkill VARCHAR,
HandleTime INT,
ServiceLevel VARCHAR,
WorkSates VARCHAR,
WaitingTime INT,
EventTime TIMESTAMP(3),
WATERMARK FOR EventTime AS EventTime - INTERVAL '4' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '<fact stream name>',
'aws.region' = '<AWS region ex. us-east-1>',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

  1. Create a dimension table in the Amazon Managed Service for Apache Flink Studio notebook that uses the metadata Kinesis data stream:
    %flink.ssql
    CREATE TABLE <Metadata Table Name> (
    AgentId INT,
    AgentName VARCHAR,
    OrgID INT,
    OrgName VARCHAR,
    update_time as CURRENT_TIMESTAMP,
    WATERMARK FOR update_time AS update_time
    )
    WITH (
    'connector' = 'kinesis',
    'stream' = '<Metadata Stream Name>',
    'aws.region' = '<AWS region ex. us-east-1> ',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
    );

  1. Create a versioned view to extract the latest version of metadata table values to be joined with the facts table:
    %flink.ssql(type=update)
    CREATE VIEW versioned_metadata AS 
    SELECT OrgID,OrgName
      FROM (
          SELECT *,
          ROW_NUMBER() OVER (PARTITION BY OrgID
             ORDER BY update_time DESC) AS rownum 
          FROM <Metadata Table>)
    WHERE rownum = 1;

  1. Join the facts and versioned metadata table on orgID and create a view that provides the total calls in the queue in each organization in a span of every 5 seconds, for further reporting. Additionally, create a tumble window of 5 seconds to receive the final output in every 5 seconds. See the following code:
    %flink.ssql(type=update)
    
    CREATE VIEW joined_view AS
    SELECT streamtable.window_start, streamtable.window_end,metadata.OrgName,streamtable.CallsInQueue
    FROM
    (SELECT window_start, window_end, OrgID, count(QueueId) as CallsInQueue
      FROM TABLE(
        TUMBLE( TABLE <Fact Table Name>, DESCRIPTOR(EventTime), INTERVAL '5' SECOND))
      GROUP BY window_start, window_end, OrgID) as streamtable
    JOIN
        <Metadata table name> metadata
        ON metadata.OrgID = streamtable.OrgID

  1. Now you can run the following query from the view you created and see the results in the notebook:
%flink.ssql(type=update)

SELECT jv.window_end, jv.CallsInQueue, jv.window_start, jv.OrgName
FROM joined_view jv where jv.OrgName = ‘Prem Hub’ ;

Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards

You can publish the metrics generated within Amazon Managed Service for Apache Flink Studio to OpenSearch Service and visualize metrics in near-real time by creating a call center performance dashboard, as shown in the following example. Refer to Stream the data and validate output to configure OpenSearch Dashboards with Amazon Managed Service for Apache Flink. After you configure the connector, you can run the following command from the notebook to create an index in an OpenSearch Service cluster.

%flink.ssql(type=update)

drop table if exists active_call_queue;
CREATE TABLE active_call_queue (
window_start TIMESTAMP,
window_end TIMESTAMP,
OrgID int,
OrgName varchar,
CallsInQueue BIGINT,
Agent_cnt bigint,
max_handle_time bigint,
min_handle_time bigint,
max_wait_time bigint,
min_wait_time bigint
) WITH (
‘connector’ = ‘elasticsearch-7’,
‘hosts’ = ‘<Amazon OpenSearch host name>’,
‘index’ = ‘active_call_queue’,
‘username’ = ‘<username>’,
‘password’ = ‘<password>’
);

The active_call_queue index is created in OpenSearch Service. The following screenshot shows an index pattern from OpenSearch Dashboards.

Now you can create visualizations in OpenSearch Dashboards. The following screenshot shows an example.

Conclusion

In this post, we discussed ways to modernize a legacy, on-premises, real-time analytics architecture and build a serverless data analytics solution on AWS using Amazon Managed Service for Apache Flink. We also discussed challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions.

If you have any questions or suggestions, please leave us a comment.


About the Authors

Bhupesh Sharma is a Senior Data Engineer with AWS. His role is helping customers architect highly-available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys playing musical instruments, road biking, and swimming.

Devika Singh is a Senior Data Engineer at Amazon, with deep understanding of AWS services, architecture, and cloud-based best practices. With her expertise in data analytics and AWS cloud migrations, she provides technical guidance to the community, on architecting and building a cost-effective, secure and scalable solution, that is driven by business needs. Beyond her professional pursuits, she is passionate about classical music and travel.

Using Generative AI, Amazon Bedrock and Amazon CodeGuru to Improve Code Quality and Security

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-generative-ai-amazon-bedrock-and-amazon-codeguru-to-improve-code-quality-and-security/

Automated code analysis plays a key role in improving code quality and compliance. Amazon CodeGuru Reviewer provides automated recommendations that can assist developers in identifying defects and deviation from coding best practices. For instance, CodeGuru Security automatically flags potential security vulnerabilities such as SQL injection, hardcoded AWS credentials and cross-site request forgery, to name a few. After becoming aware of these findings, developers can take decisive action to remediate their code.

On the other hand, determining what the best course of action is to address a particular automated recommendation might not always be obvious. For instance, an apprentice developer may not fully grasp what a SQL injection attack means or what makes the code at hand particularly vulnerable. In another situation, the developer reviewing a CodeGuru recommendation might not be the same developer who wrote the initial code. In these cases, the developer will first need to get familiarized with the code and the recommendation in order to take proper corrective action.

By using Generative AI, developers can leverage pre-trained foundation models to gain insights on their code’s structure, the CodeGuru Reviewer recommendation and the potential corrective actions. For example, Generative AI models can generate text content, e.g., to explain a technical concept such as SQL injection attacks or the correct use of a given library. Once the recommendation is well understood, the Generative AI model can be used to refactor the original code so that it complies with the recommendation. The possibilities opened up by Generative AI are numerous when it comes to improving code quality and security.

In this post, we will show how you can use CodeGuru Reviewer and Bedrock to improve the quality and security of your code. While CodeGuru Reviewer can provide automated code analysis and recommendations, Bedrock offers a low-friction environment that enables you to gain insights on the CodeGuru recommendations and to find creative ways to remediate your code.

Solution Overview

The diagram below depicts our approach and the AWS services involved. It works as follows:

1. The developer pushes code to an AWS CodeCommit repository.
2. The repository is associated with CodeGuru Reviewer, so an automated code review is initiated.
3. Upon completion, the CodeGuru Reviewer console displays a list of recommendations for the code base, if applicable.
4. Once aware of the recommendation and the affected code, the developer navigates to the Bedrock console, chooses a foundation model and builds a prompt (we will give examples of prompts in the next session).
5. Bedrock generates content as a response to the prompt, including code generation.
6. The developer might optionally refine the prompt, for example, to gain further insights on the CodeGuru Reviewer recommendation or to request for alternatives to remediate the code.
7. The model can respond with generated code that addresses the issue which can then be pushed back into the repository.

CodeCommit, CodeGuru and Bedrock used together

CodeCommit, CodeGuru and Bedrock used together

Note that we use CodeCommit in our walkthrough but readers can use any Git sources supported by CodeGuru Reviewer.

Using Generative AI to Improve Code Quality and Security

Next, we’re going to walk you through a scenario where a developer needs to improve the quality of her code after CodeGuru Reviewer has provided recommendations. But before getting there, let’s choose a code repository and set the Bedrock inference parameters.

A good reference of source repository for exploring CodeGuru Reviewer recommendations is the Amazon CodeGuru Reviewer Python Detector repository. The repository contains a comprehensive list of compliant and non-compliant code which fits well in the context of our discussion.

In terms of Bedrock model, we use Anthropic Claude V1 (v1.3) in our analysis which is specialized in content generation including text and code. We set the required model parameters as follows: temperature=0.5, top_p=0.9, top_k=500, max_tokens=2048. We set temperature and top_p parameters so as to give the model a bit more flexibility to generate responses for the same question. Please check the inference parameter definitions on Bedrock’s user guide for further details on these parameters. Given the randomness level specified by our inference parameters, readers experimenting with the prompts provided in this post might observe slightly different answers than the ones presented.

Requirements

  • An AWS account with access to CodeCommit, CodeGuru and Bedrock
  • Bedrock access enabled in the account. On-demand access should be fine (check pricing here).
  • Download and install the AWS CLI and Git (to push code to CodeCommit)

Walkthrough

Follow the steps below to run CodeGuru Reviewer analysis on a repository and to build and run Bedrock prompts.

  • Clone the from GitHub to your local workstation
git clone https://github.com/aws-samples/amazon-codeguru-reviewer-python-detectors.git
  • Create a CodeCommit repository and add a new Git remote
aws codecommit create-repository --repository-name amazon-codeguru-reviewer-python-detectors

cd amazon-codeguru-reviewer-python-detectors/

git remote add codecommit https://git-codecommit.us-east-1.amazonaws.com/v1/repos/amazon-codeguru-reviewer-python-detectors
  • Associate CodeGuru Reviewer with the repository to enable repository analysis
aws codeguru-reviewer associate-repository --repository 'CodeCommit={Name=amazon-codeguru-reviewer-python-detectors}'

Save the association ARN value returned after the command is executed (e.g., arn:aws:codeguru-reviewer:xx-xxxx-x:111111111111:association:e85aa20c-41d76-03b-f788-cefd0d2a3590).

  • Push code to the CodeCommit repository using the codecommit git remote
git push codecommit main:main
  • Trigger CodeGuru Reviewer to run a repository analysis on the repository’s main branch. Use the repository association ARN you noted in a previous step here.
aws codeguru-reviewer create-code-review \
 --name codereview001 \
 --type '{"RepositoryAnalysis": {"RepositoryHead": {"BranchName": "main"}}}' \
 --repository-association-arn arn:aws:codeguru-reviewer:xx-xxxx-x:111111111111:association:e85aa20c-41d76-03b-f788-cefd0d2a3590

Navigate to the CodeGuru Reviewer Console to see the various recommendations provided (you might have to wait a few minutes for the code analysis to run).

Amazon CodeGuru reviewer

Amazon CodeGuru Reviewer

  • On the CodeGuru Reviewer console (see screenshot above), we select the first recommendation on file hashlib_contructor.py, line 12, and take note of the recommendation content: The constructors for the hashlib module are faster than new(). We recommend using hashlib.sha256() instead.
  • Now let’s extract the affected code. Click on the file name link (hashlib_contructor.py in the figure above) to open the corresponding code in the CodeCommit console.
AWS CodeCommit Repository

AWS CodeCommit Repository

  • The blue arrow in the CodeCommit console above indicates the non-compliant code highlighting the specific line (line 12). We select the wrapping python function from lines 5 through 15 to build our prompt. You may want to experiment reducing the scope to a single line or a given block of lines and check if it yields better responses.
Amazon Bedrock Playground Console

Amazon Bedrock Playground Console

  • We then navigate to the Bedrock console (see screenshot above).
    • Search for keyword Bedrock in the AWS console
    • Select the Bedrock service to navigate to the service console
    • Choose Playgrounds, then choose Text
    • Choose model Anthropic Claude V1 (1.3). If you don’t see this model available, please make sure to enable model access.
  • Set the Inference configuration as shown in the screenshot below including temperature, Top P and the other parameters. Please check the inference parameter definitions on Bedrock’s user guide for further details on these parameters.
  • Build a Bedrock prompt using three elements, as illustrated in the screenshot below:
    • The source code copied from CodeCommit
    • The CodeGuru Reviewer recommendation
    • A request to refactor the code to address the code analysis finding
A Prompt in the Amazon Bedrock Playground Console

A Prompt in the Amazon Bedrock Playground Console

  • Press the Run button. Notice that Bedrock will automatically add the words Human (at the top) and Assistant (at the bottom) to the prompt.  Wait a few seconds and a response is generated (in green). The response includes the refactored code and an explanation on how the code was fixed (see screenshot below).
A Prompt Response (or completion) in the Amazon Bedrock Playground Console

A Prompt Response (or completion) in the Amazon Bedrock Playground Console

Note that the original code was refactored to use ashlib.sha256() instead of  using new in the constructor: hashlib.new(‘sha256’, …). Note that the prompt also asks for an explanation on how the refactored code fixes the issue, so the response includes such details. If we were interested in the refactored code only, we could change the prompt and ask that it returns only the refactored code.

Gaining Insights and Improving Code Quality

So far, our walkthrough showed how to navigate through the CodeCommit, CodeGuru Reviewer and Bedrock consoles to gather relevant content that can be used to build and run a Bedrock prompt. We are now going to explore further prompts to gain insights on the code and the recommendations in order to build more specific prompt to refactor the code. We omit screenshots for brevity. Readers can follow the same process outlined in the previous session.

Learning about the code

Let’s now examine another CodeGuru Reviewer recommendation. Specifically, we want to find the SQL Injection vulnerability raised for file, line 12, as shown below.

Sample non-compliant code

Sample non-compliant code

CodeGuru Reviewer provided the following recommendation:

CodeGuru Reviewer recommendation

For the sake of the walkthrough, let’s assume the developer reviewing the code above was not the original developer who built the code. We start by asking Bedrock to explain the code, line by line. Note that we have not included the CodeGuru Reviewer recommendation in the prompt this time as our initial goal is to understand the code.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Explain the code logic above line by line.

Response:

Sure, here is a line by line explanation of the code:
def execute_query_noncompliant(request):
This defines a function called execute_query_noncompliant that takes a request as an argument.
import sqlite3
This imports the sqlite3 library to connect to a SQLite database.
name = request.GET.get("name")
This gets a "name" query parameter from the request GET parameters.
query = "SELECT * FROM Users WHERE name = " + name + ";"
This constructs an SQL query string by concatenating the name variable into the query. 
So if name was "John", the query would be:
SELECT * FROM Users WHERE name = John;
with sqlite3.connect("example.db") as connection:
This opens a connection to the "example.db" SQLite database.
   cursor = connection.cursor()
This gets a cursor to execute queries on the connection.
    # Noncompliant: user input is used without sanitization.
    cursor.execute(query)
This executes the query on the database. The comment notes that this is noncompliant 
because it uses unsanitized user input in the query.
    connection.commit()
    connection.close()

Bedrock provides a line-by-line explanation of the code including the library being used (sqllite3) and a comment about the code using an unsanitized [MOU1] user input. The explanation is detailed helping the developer to understand the code logic and structure.

The developer now has a good understanding of the code. Let’s now gain insights on why and how the code is vulnerable to SQL Injection.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Why is the code above vulnerable to SQL injection?

Response:

The code is vulnerable to SQL injection because it directly concatenates 
user input (the name parameter) into an SQL query string. This allows a 
malicious user to inject SQL code into the query.
For example, if the name parameter is passed as:

'; DROP TABLE Users; -- 
Then the resulting SQL query would be:

SELECT * FROM Users WHERE name = ''; DROP TABLE Users; -- ;
This would drop the Users table, causing damage to the database.
To fix this, the user input should be sanitized before being added to the 
SQL query string. This can be done by:

Escaping single quotes in the input:
name = request.GET.get("name").replace("'", "''")
query = f"SELECT * FROM Users WHERE name = '{name}';"

Using parameterized queries:
query = "SELECT * FROM Users WHERE name = ?"
cursor.execute(query, (name,))

Again, the response is detailed. It explains that the user input is being concatenated directly into a query string. It gives an example of a SQL injection attack that could cause the Users table to be dropped. It also provides two options to fix the code including escaping single quotes and using parameterized queries.

We continue to dive deep into SQL injection. Let’s build a prompt to gather further examples of possible SQL injection attacks.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Provide examples of SQL injection for the code above

Response:

Here are some examples of SQL injection for that code:

If the name parameter is passed as:
'; DROP TABLE Users; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = ''; DROP TABLE Users; -- ;
This would drop the Users table.

If the name parameter is passed as:
OR 1=1; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = OR 1=1; -- ;
This would select all rows from the Users table.

If the name parameter is passed as:
'; UPDATE Users SET password = 'hacked' WHERE 1=1; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = ''; UPDATE Users SET password = 'hacked' WHERE 1=1; -- ;

This would update all passwords in the Users table to 'hacked'.
So in summary, without sanitizing the input, this code is vulnerable to SQL injection attacks 
that can delete data, access unauthorized data, modify data, etc. The input should be sanitized 
by escaping special characters before using in the SQL query.

Bedrock responds by including two new forms of SQL injection attacks, one that lists all records from the Users table and another one that updates all passwords in the table.

At this point, the developer has learned a lot about the code, SQL injection attacks and the CodeGuru Reviewer recommendation. It’s now time to build a prompt to fix the code. We leverage what we learned in previous prompts including the term “parameterized queries”, as suggested by CodeGuru Reviewer, to build a specific request in the prompt. Note that we have also included the CodeGuru Reviewer recommendation in the prompt this time.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()

Code Analysis Finding: 
It seems that you are constructing an SQL query using formatted strings. This could 
make your code vulnerable to SQL injection attacks, which can lead to loss of critical 
data. We recommend that you use parameterized queries for constructing SQL queries.  
[Learn more](https://cwe.mitre.org/data/definitions/89.html)
Request:
Fix the code above using parameterized query.

Response:

Here is the fixed code using parameterized query:

def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Instead of concatenating the user input directly into the SQL string, we use a ? 
placeholder and pass the user input as a tuple argument to cursor.execute(). This 
prevents SQL injection since the input is treated as a literal value rather than 
part of the SQL syntax.

The response includes a refactored code that uses a parameterized query to prevent SQL injection attacks (cursor.execute(query, (name,)).


We’re excited with the improvements in the code and now decide to go beyond fixing the SQL injection vulnerability. The next prompt asks Bedrock to make two specific improvements to the refactored.

  • Add python type hints
  • Add exception handling

Prompt:

Source Code:
def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Request: 
Add exception handling and type hints to the code above.

Response:

Here is the code with exception handling and type hints:

def execute_query_compliant(request: Request) -> None: 
   """Executes a compliant SQL query"""
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   try:
      with sqlite3.connect("example.db") as connection: 
          cursor: sqlite3.Cursor = connection.cursor()
          # Compliant: Parameterized query prevents SQL injection
          cursor.execute(query, (name,)) 
          connection.commit()
   except sqlite3.Error as e: 
      print(f"SQLite error: {e}")
   finally: 
      connection.close()

I added:

Type hints:
request: Request indicating the request parameter is of type Request

cursor: sqlite3.Cursor indicating cursor is of type sqlite3.Cursor
Exception handling:
A try/except block catching sqlite3.Error exceptions. 
A finally block ensuring the connection is always closed. 
Let me know if you would like me to explain anything in the updated code!

Note that the generated code now includes a try/except/finally block to handle exceptions and type hints for the function parameter and return value.

However, some variables such as name and query are missing type hints. Here, we’d like to call out the importance of human supervision on automatically generated responses. The developer tweaks the previous prompt to ask that “all variables” have type hint added to them.

Prompt:

Source Code:
def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Request: 
Add exception handling and type hints to all variables in the code above.

Response:

Here is the code with exception handling and type hints:

def execute_query_compliant(request: Request) -> None: 
   """Executes a compliant SQL query"""
   import sqlite3
   name: str = request.GET.get("name")
   query: str = "SELECT * FROM Users WHERE name = ?" 
   try:
      with sqlite3.connect("example.db") as connection: 
         cursor: sqlite3.Cursor = connection.cursor()
         # Compliant: Parameterized query prevents SQL injection
         cursor.execute(query, (name,)) 
         connection.commit()
   except sqlite3.Error as e: 
        print(f"SQLite error: {e}")
   finally: 
        connection.close()

The new response refactors the code making sure all variables now have type hint. This shows how little changes to a prompt can produce much better results. In our example, by adding “all variables” to the prompt caused the response to indeed add type hints to all variables in the code provided.

Here is a summary of the activities performed via Bedrock prompting:

  • Gain insights on the code and the CodeGuru recommendation
    • Explain the code logic above line by line.
    • Why is the code above vulnerable to SQL injection?
    • Provide examples of SQL injection for the code above
  • Refactor and Improve the Code
    • Fix the code above using parameterized query
    • Add exception handling and type hints to the code above
    • Add exception handling and type hints to all variables in the code above.

The main takeaway is that by using a static analysis and security testing tool such as CodeGuru Reviewer in combination with a Generative AI service such as Bedrock, developers can significantly improve their code towards best practices and enhanced security. In addition, prompts which are more specific normally yield better results and that’s when CodeGuru Reviewer can be really helpful as it gives developers hints and keywords that can be used to build powerful prompts.

Cleaning Up

Don’t forget to delete the CodeCommit repository created if you no longer need it.

aws codecommit delete-repository -–repository-name amazon-codeguru-reviewer-python-detectors

Conclusion and Call to Action

In this blog, we discussed how CodeGuru Reviewer and Bedrock can be used in combination to improve code quality and security. While CodeGuru Reviewer provides a rich set of recommendations through automated code reviews, Bedrock gives developers the ability to gain deeper insights on the code and the recommendations as well as to refactor the original code to meet compliance and best practices.

We encourage readers to explore new Bedrock prompts beyond the ones introduced in this post and share their feedback with us.

Here are some ideas:

For a sample Python repository we recommend using the Amazon CodeGuru Reviewer Python Detector repository on GitHub which is publicly accessible to readers.

For Java developers, there’s a CodeGuru Reviewer Python Detector for Java repository alternative available.

Note: at the time of the writing of this post, Bedrock’s Anthropic Claude 2.0 model was not yet available so we invite readers to also experiment with the prompts provided using that model.

Special thanks to my colleagues Raghvender Arni and Mahesh Yadav for support and review of this post.
Author: Marcilio Mendonca

Marcilio Mendonca

Marcilio Mendonca is a Sr. Solutions Developer in the Prototyping And Customer Engineering (PACE) team at Amazon Web Services. He is passionate about helping customers rethink and reinvent their business through the art of prototyping, primarily in the realm of modern application development, Serverless and AI/ML. Prior to joining AWS, Marcilio was a Software Development Engineer with Amazon. He also holds a PhD in Computer Science. You can find Marcilio on LinkedIn at https://www.linkedin.com/in/marcilio/. Let’s connect!

Blue/Green deployments using AWS CDK Pipelines and AWS CodeDeploy

Post Syndicated from Luiz Decaro original https://aws.amazon.com/blogs/devops/blue-green-deployments-using-aws-cdk-pipelines-and-aws-codedeploy/

Customers often ask for help with implementing Blue/Green deployments to Amazon Elastic Container Service (Amazon ECS) using AWS CodeDeploy. Their use cases usually involve cross-Region and cross-account deployment scenarios. These requirements are challenging enough on their own, but in addition to those, there are specific design decisions that need to be considered when using CodeDeploy. These include how to configure CodeDeploy, when and how to create CodeDeploy resources (such as Application and Deployment Group), and how to write code that can be used to deploy to any combination of account and Region.

Today, I will discuss those design decisions in detail and how to use CDK Pipelines to implement a self-mutating pipeline that deploys services to Amazon ECS in cross-account and cross-Region scenarios. At the end of this blog post, I also introduce a demo application, available in Java, that follows best practices for developing and deploying cloud infrastructure using AWS Cloud Development Kit (AWS CDK).

The Pipeline

CDK Pipelines is an opinionated construct library used for building pipelines with different deployment engines. It abstracts implementation details that developers or infrastructure engineers need to solve when implementing a cross-Region or cross-account pipeline. For example, in cross-Region scenarios, AWS CloudFormation needs artifacts to be replicated to the target Region. For that reason, AWS Key Management Service (AWS KMS) keys, an Amazon Simple Storage Service (Amazon S3) bucket, and policies need to be created for the secondary Region. This enables artifacts to be moved from one Region to another. In cross-account scenarios, CodeDeploy requires a cross-account role with access to the KMS key used to encrypt configuration files. This is the sort of detail that our customers want to avoid dealing with manually.

AWS CodeDeploy is a deployment service that automates application deployment across different scenarios. It deploys to Amazon EC2 instances, On-Premises instances, serverless Lambda functions, or Amazon ECS services. It integrates with AWS Identity and Access Management (AWS IAM), to implement access control to deploy or re-deploy old versions of an application. In the Blue/Green deployment type, it is possible to automate the rollback of a deployment using Amazon CloudWatch Alarms.

CDK Pipelines was designed to automate AWS CloudFormation deployments. Using AWS CDK, these CloudFormation deployments may include deploying application software to instances or containers. However, some customers prefer using CodeDeploy to deploy application software. In this blog post, CDK Pipelines will deploy using CodeDeploy instead of CloudFormation.

A pipeline build with CDK Pipelines that deploys to Amazon ECS using AWS CodeDeploy. It contains at least 5 stages: Source, Build, UpdatePipeline, Assets and at least one Deployment stage.

Design Considerations

In this post, I’m considering the use of CDK Pipelines to implement different use cases for deploying a service to any combination of accounts (single-account & cross-account) and regions (single-Region & cross-Region) using CodeDeploy. More specifically, there are four problems that need to be solved:

CodeDeploy Configuration

The most popular options for implementing a Blue/Green deployment type using CodeDeploy are using CloudFormation Hooks or using a CodeDeploy construct. I decided to operate CodeDeploy using its configuration files. This is a flexible design that doesn’t rely on using custom resources, which is another technique customers have used to solve this problem. On each run, a pipeline pushes a container to a repository on Amazon Elastic Container Registry (ECR) and creates a tag. CodeDeploy needs that information to deploy the container.

I recommend creating a pipeline action to scan the AWS CDK cloud assembly and retrieve the repository and tag information. The same action can create the CodeDeploy configuration files. Three configuration files are required to configure CodeDeploy: appspec.yaml, taskdef.json and imageDetail.json. This pipeline action should be executed before the CodeDeploy deployment action. I recommend creating template files for appspec.yaml and taskdef.json. The following script can be used to implement the pipeline action:

##
#!/bin/sh
#
# Action Configure AWS CodeDeploy
# It customizes the files template-appspec.yaml and template-taskdef.json to the environment
#
# Account = The target Account Id
# AppName = Name of the application
# StageName = Name of the stage
# Region = Name of the region (us-east-1, us-east-2)
# PipelineId = Id of the pipeline
# ServiceName = Name of the service. It will be used to define the role and the task definition name
#
# Primary output directory is codedeploy/. All the 3 files created (appspec.json, imageDetail.json and 
# taskDef.json) will be located inside the codedeploy/ directory
#
##
Account=$1
Region=$2
AppName=$3
StageName=$4
PipelineId=$5
ServiceName=$6
repo_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages[] | .destinations[] | .repositoryName' | head -1) 
tag_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages | to_entries[0].key')  
echo ${repo_name} 
echo ${tag_name} 
printf '{"ImageURI":"%s"}' "$Account.dkr.ecr.$Region.amazonaws.com/${repo_name}:${tag_name}" > codedeploy/imageDetail.json                     
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-appspec.yaml > codedeploy/appspec.yaml 
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-taskdef.json | sed 's#TASK_EXEC_ROLE#arn:aws:iam::'$Account':role/'$ServiceName'#g' | sed 's#fargate-task-definition#'$ServiceName'#g' > codedeploy/taskdef.json 
cat codedeploy/appspec.yaml
cat codedeploy/taskdef.json
cat codedeploy/imageDetail.json

Using a Toolchain

A good strategy is to encapsulate the pipeline inside a Toolchain to abstract how to deploy to different accounts and regions. This helps decoupling clients from the details such as how the pipeline is created, how CodeDeploy is configured, and how cross-account and cross-Region deployments are implemented. To create the pipeline, deploy a Toolchain stack. Out-of-the-box, it allows different environments to be added as needed. Depending on the requirements, the pipeline may be customized to reflect the different stages or waves that different components might require. For more information, please refer to our best practices on how to automate safe, hands-off deployments and its reference implementation.

In detail, the Toolchain stack follows the builder pattern used throughout the CDK for Java. This is a convenience that allows complex objects to be created using a single statement:

 Toolchain.Builder.create(app, Constants.APP_NAME+"Toolchain")
        .stackProperties(StackProps.builder()
                .env(Environment.builder()
                        .account(Demo.TOOLCHAIN_ACCOUNT)
                        .region(Demo.TOOLCHAIN_REGION)
                        .build())
                .build())
        .setGitRepo(Demo.CODECOMMIT_REPO)
        .setGitBranch(Demo.CODECOMMIT_BRANCH)
        .addStage(
                "UAT",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                             
        .build();

In the statement above, the continuous deployment pipeline is created in the TOOLCHAIN_ACCOUNT and TOOLCHAIN_REGION. It implements a stage that builds the source code and creates the Java archive (JAR) using Apache Maven.  The pipeline then creates a Docker image containing the JAR file.

The UAT stage will deploy the service to the SERVICE_ACCOUNT and SERVICE_REGION using the deployment configuration CANARY_10_PERCENT_5_MINUTES. This means 10 percent of the traffic is shifted in the first increment and the remaining 90 percent is deployed 5 minutes later.

To create additional deployment stages, you need a stage name, a CodeDeploy deployment configuration and an environment where it should deploy the service. As mentioned, the pipeline is, by default, a self-mutating pipeline. For example, to add a Prod stage, update the code that creates the Toolchain object and submit this change to the code repository. The pipeline will run and update itself adding a Prod stage after the UAT stage. Next, I show in detail the statement used to add a new Prod stage. The new stage deploys to the same account and Region as in the UAT environment:

... 
        .addStage(
                "Prod",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                                                      
        .build();

In the statement above, the Prod stage will deploy new versions of the service using a CodeDeploy deployment configuration CANARY_10_PERCENT_5_MINUTES. It means that 10 percent of traffic is shifted in the first increment of 5 minutes. Then, it shifts the rest of the traffic to the new version of the application. Please refer to Organizing Your AWS Environment Using Multiple Accounts whitepaper for best-practices on how to isolate and manage your business applications.

Some customers might find this approach interesting and decide to provide this as an abstraction to their application development teams. In this case, I advise creating a construct that builds such a pipeline. Using a construct would allow for further customization. Examples are stages that promote quality assurance or deploy the service in a disaster recovery scenario.

The implementation creates a stack for the toolchain and another stack for each deployment stage. As an example, consider a toolchain created with a single deployment stage named UAT. After running successfully, the DemoToolchain and DemoService-UAT stacks should be created as in the next image:

Two stacks are needed to create a Pipeline that deploys to a single environment. One stack deploys the Toolchain with the Pipeline and another stack deploys the Service compute infrastructure and CodeDeploy Application and DeploymentGroup. In this example, for an application named Demo that deploys to an environment named UAT, the stacks deployed are: DemoToolchain and DemoService-UAT

CodeDeploy Application and Deployment Group

CodeDeploy configuration requires an application and a deployment group. Depending on the use case, you need to create these in the same or in a different account from the toolchain (pipeline). The pipeline includes the CodeDeploy deployment action that performs the blue/green deployment. My recommendation is to create the CodeDeploy application and deployment group as part of the Service stack. This approach allows to align the lifecycle of CodeDeploy application and deployment group with the related Service stack instance.

CodePipeline allows to create a CodeDeploy deployment action that references a non-existing CodeDeploy application and deployment group. This allows us to implement the following approach:

  • Toolchain stack deploys the pipeline with CodeDeploy deployment action referencing a non-existing CodeDeploy application and deployment group
  • When the pipeline executes, it first deploys the Service stack that creates the related CodeDeploy application and deployment group
  • The next pipeline action executes the CodeDeploy deployment action. When the pipeline executes the CodeDeploy deployment action, the related CodeDeploy application and deployment will already exist.

Below is the pipeline code that references the (initially non-existing) CodeDeploy application and deployment group.

private IEcsDeploymentGroup referenceCodeDeployDeploymentGroup(
        final Environment env, 
        final String serviceName, 
        final IEcsDeploymentConfig ecsDeploymentConfig, 
        final String stageName) {

    IEcsApplication codeDeployApp = EcsApplication.fromEcsApplicationArn(
            this,
            Constants.APP_NAME + "EcsCodeDeployApp-"+stageName,
            Arn.format(ArnComponents.builder()
                    .arnFormat(ArnFormat.COLON_RESOURCE_NAME)
                    .partition("aws")
                    .region(env.getRegion())
                    .service("codedeploy")
                    .account(env.getAccount())
                    .resource("application")
                    .resourceName(serviceName)
                    .build()));

    IEcsDeploymentGroup deploymentGroup = EcsDeploymentGroup.fromEcsDeploymentGroupAttributes(
            this,
            Constants.APP_NAME + "-EcsCodeDeployDG-"+stageName,
            EcsDeploymentGroupAttributes.builder()
                    .deploymentGroupName(serviceName)
                    .application(codeDeployApp)
                    .deploymentConfig(ecsDeploymentConfig)
                    .build());

    return deploymentGroup;
}

To make this work, you should use the same application name and deployment group name values when creating the CodeDeploy deployment action in the pipeline and when creating the CodeDeploy application and deployment group in the Service stack (where the Amazon ECS infrastructure is deployed). This approach is necessary to avoid a circular dependency error when trying to create the CodeDeploy application and deployment group inside the Service stack and reference these objects to configure the CodeDeploy deployment action inside the pipeline. Below is the code that uses Service stack construct ID to name the CodeDeploy application and deployment group. I set the Service stack construct ID to the same name I used when creating the CodeDeploy deployment action in the pipeline.

   // configure AWS CodeDeploy Application and DeploymentGroup
   EcsApplication app = EcsApplication.Builder.create(this, "BlueGreenApplication")
           .applicationName(id)
           .build();

   EcsDeploymentGroup.Builder.create(this, "BlueGreenDeploymentGroup")
           .deploymentGroupName(id)
           .application(app)
           .service(albService.getService())
           .role(createCodeDeployExecutionRole(id))
           .blueGreenDeploymentConfig(EcsBlueGreenDeploymentConfig.builder()
                   .blueTargetGroup(albService.getTargetGroup())
                   .greenTargetGroup(tgGreen)
                   .listener(albService.getListener())
                   .testListener(listenerGreen)
                   .terminationWaitTime(Duration.minutes(15))
                   .build())
           .deploymentConfig(deploymentConfig)
           .build();

CDK Pipelines roles and permissions

CDK Pipelines creates roles and permissions the pipeline uses to execute deployments in different scenarios of regions and accounts. When using CodeDeploy in cross-account scenarios, CDK Pipelines deploys a cross-account support stack that creates a pipeline action role for the CodeDeploy action. This cross-account support stack is defined in a JSON file that needs to be published to the AWS CDK assets bucket in the target account. If the pipeline has the self-mutation feature on (default), the UpdatePipeline stage will do a cdk deploy to deploy changes to the pipeline. In cross-account scenarios, this deployment also involves deploying/updating the cross-account support stack. For this, the SelfMutate action in UpdatePipeline stage needs to assume CDK file-publishing and a deploy roles in the remote account.

The IAM role associated with the AWS CodeBuild project that runs the UpdatePipeline stage does not have these permissions by default. CDK Pipelines cannot grant these permissions automatically, because the information about the permissions that the cross-account stack needs is only available after the AWS CDK app finishes synthesizing. At that point, the permissions that the pipeline has are already locked-in­­. Hence, for cross-account scenarios, the toolchain should extend the permissions of the pipeline’s UpdatePipeline stage to include the file-publishing and deploy roles.

In cross-account environments it is possible to manually add these permissions to the UpdatePipeline stage. To accomplish that, the Toolchain stack may be used to hide this sort of implementation detail. In the end, a method like the one below can be used to add these missing permissions. For each different mapping of stage and environment in the pipeline it validates if the target account is different than the account where the pipeline is deployed. When the criteria is met, it should grant permission to the UpdatePipeline stage to assume CDK bootstrap roles (tagged using key aws-cdk:bootstrap-role) in the target account (with the tag value as file-publishing or deploy). The example below shows how to add permissions to the UpdatePipeline stage:

private void grantUpdatePipelineCrossAccoutPermissions(Map<String, Environment> stageNameEnvironment) {

    if (!stageNameEnvironment.isEmpty()) {

        this.pipeline.buildPipeline();
        for (String stage : stageNameEnvironment.keySet()) {

            HashMap<String, String[]> condition = new HashMap<>();
            condition.put(
                    "iam:ResourceTag/aws-cdk:bootstrap-role",
                    new String[] {"file-publishing", "deploy"});
            pipeline.getSelfMutationProject()
                    .getRole()
                    .addToPrincipalPolicy(PolicyStatement.Builder.create()
                            .actions(Arrays.asList("sts:AssumeRole"))
                            .effect(Effect.ALLOW)
                            .resources(Arrays.asList("arn:*:iam::"
                                    + stageNameEnvironment.get(stage).getAccount() + ":role/*"))
                            .conditions(new HashMap<String, Object>() {{
                                    put("ForAnyValue:StringEquals", condition);
                            }})
                            .build());
        }
    }
}

The Deployment Stage

Let’s consider a pipeline that has a single deployment stage, UAT. The UAT stage deploys a DemoService. For that, it requires four actions: DemoService-UAT (Prepare and Deploy), ConfigureBlueGreenDeploy and Deploy.

When using CodeDeploy the deployment stage is expected to have four actions: two actions to create CloudFormation change set and deploy the ECS or compute infrastructure, an action to configure CodeDeploy and the last action that deploys the application using CodeDeploy. In the diagram, these are (in the diagram in the respective order): DemoService-UAT.Prepare and DemoService-UAT.Deploy, ConfigureBlueGreenDeploy and Deploy.

The
DemoService-UAT.Deploy action will create the ECS resources and the CodeDeploy application and deployment group. The
ConfigureBlueGreenDeploy action will read the AWS CDK
cloud assembly. It uses the configuration files to identify the Amazon Elastic Container Registry (Amazon ECR) repository and the container image tag pushed. The pipeline will send this information to the
Deploy action.  The
Deploy action starts the deployment using CodeDeploy.

Solution Overview

As a convenience, I created an application, written in Java, that solves all these challenges and can be used as an example. The application deployment follows the same 5 steps for all deployment scenarios of account and Region, and this includes the scenarios represented in the following design:

A pipeline created by a Toolchain should be able to deploy to any combination of accounts and regions. This includes four scenarios: single-account and single-Region, single-account and cross-Region, cross-account and single-Region and cross-account and cross-Region

Conclusion

In this post, I identified, explained and solved challenges associated with the creation of a pipeline that deploys a service to Amazon ECS using CodeDeploy in different combinations of accounts and regions. I also introduced a demo application that implements these recommendations. The sample code can be extended to implement more elaborate scenarios. These scenarios might include automated testing, automated deployment rollbacks, or disaster recovery. I wish you success in your transformative journey.

Luiz Decaro

Luiz is a Principal Solutions architect at Amazon Web Services (AWS). He focuses on helping customers from the Financial Services Industry succeed in the cloud. Luiz holds a master’s in software engineering and he triggered his first continuous deployment pipeline in 2005.

Use SAML with Amazon Cognito to support a multi-tenant application with a single user pool

Post Syndicated from Neela Kulkarni original https://aws.amazon.com/blogs/security/use-saml-with-amazon-cognito-to-support-a-multi-tenant-application-with-a-single-user-pool/

Amazon Cognito is a customer identity and access management solution that scales to millions of users. With Cognito, you have four ways to secure multi-tenant applications: user pools, application clients, groups, or custom attributes. In an earlier blog post titled Role-based access control using Amazon Cognito and an external identity provider, you learned how to configure Cognito authentication and authorization with a single tenant. In this post, you will learn to configure Cognito with a single user pool for multiple tenants to securely access a business-to-business application by using SAML custom attributes. With custom-attribute–based multi-tenancy, you can store tenant identification data like tenantName as a custom attribute in a user’s profile and pass it to your application. You can then handle multi-tenancy logic in your application and backend services. With this approach, you can use a unified sign-up and sign-in experience for your users. To identify the user’s tenant, your application can use the tenantName custom attribute.

One Cognito user pool for multiple customers

Customers like the simplicity of using a single Cognito user pool for their multi-customer application. With this approach, your customers will use the same URL to access the application. You will set up each new customer by configuring SAML 2.0 integration with the customer’s external identity provider (IdP). Your customers can control access to your application by using an external identity store, such as Google Workspace, Okta, or Active Directory Federation Service (AD FS), in which they can create, manage, and revoke access for their users.

After SAML integration is configured, Cognito returns a JSON web token (JWT) to the frontend during the user authentication process. This JWT contains attributes your application can use for authorization and access control. The token contains claims about the identity of the authenticated user, such as name and email. You can use this identity information inside your application. You can also configure Cognito to add custom attributes to the JWT, such as tenantName.

In this post, we demonstrate the approach of keeping a mapping between a user’s email domain and tenant name in an Amazon DynamoDB table. The DynamoDB table will have an emailDomain field as a key and a corresponding tenantName field.

Cognito architecture

To illustrate how this works, we’ll start with a demo application that was introduced in the earlier blog post. The demo application is implemented by using Amazon Cognito, AWS Amplify, Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon CloudFront to achieve a serverless architecture. This architecture is shown in Figure 1.

Figure 1: Demo application architecture

Figure 1: Demo application architecture

The workflow that happens when you access the web application for the first time using your browser is as follows (the numbered steps correspond to the numbered labels in the diagram):

  1. The client-side/frontend of the application prompts you to enter the email that you want to use to sign in to the application.
  2. The application invokes the Tenant Match API action through API Gateway, which, in turn, calls the Lambda function that takes the email address as an input and queries it against the DynamoDB table with the email domain. Figure 2 shows the data stored in DynamoDB, which includes the tenant name and IdP ID. You can add additional flexibility to this solution by adding web client IDs or custom redirect URLs. For the purpose of this example, we are using the same redirect URL for all tenants (the client application).
    Figure 2: DynamoDB tenant table

    Figure 2: DynamoDB tenant table

  3. If a matching record is found, the Lambda function returns the record to the AWS Amplify frontend application.
  4. The client application uses the IdP ID from the response and passes it to Cognito for federated login. Cognito then reroutes the login request to the corresponding IdP. The AWS Amplify frontend application then redirects the browser to the IdP.
  5. At the IdP sign-in page, you sign in with a valid user account (for example, [email protected] or [email protected]). After you sign in successfully, a SAML response is sent back from the IdP to Cognito.

    You can review the SAML content by using the instructions in How to view a SAML response in your browser for troubleshooting, as shown in Figure 3.

    Figure 3: SAML content

    Figure 3: SAML content

  6. Cognito handles the SAML response and maps the SAML attributes to a just-in-time user profile. The SAML groups attributes is mapped to a custom user pool attribute named custom:groups.
  7. To identify the tenant, additional attributes are populated in the JWT. After successful authentication, a PreTokenGeneration Lambda function is invoked, which reads the mapped custom:groups attribute value from SAML, parses it, and converts it to an array. After that, the function parses the email address and captures the domain name. It then queries the DynamoDB table for the tenantName name by using the email domain name. Finally, the function sets the custom:domainName and custom:tenantName attributes in the JWT, as shown following.
    "email": "[email protected]" ( Standard existing profile attribute )
    New attributes:
    "cognito:groups": [.                           
    "pet-app-users",
    "pet-app-admin"
    ],
    "custom:tenantName": "AnyCompany"
    "custom:domainName": "anycompany.com"

    This attribute conversion is optional and demonstrates how you can use a PreTokenGeneration Lambda invocation to customize your JWT token claims, mapping the IdP groups to the attributes your application recognizes. You can also use this invocation to make additional authorization decisions. For example, if user is a member of multiple groups, you may choose to map only one of them.

  8. Amazon Cognito returns the JWT tokens to the AWS Amplify frontend application. The Amplify client library stores the tokens and handles refreshes. This token is used to make calls to protected APIs in Amazon API Gateway.
  9. API Gateway uses a Cognito user pools authorizer to validate the JWT’s signature and expiration. If this is successful, API Gateway passes the JWT to the application’s Lambda function (also referred to as the backend).
  10. The backend application code reads the cognito:groups claim from the JWT and decides if the action is allowed. If the user is a member of the right group, then the action is allowed; otherwise the action is denied.

Implement the solution

You can implement this example application by using an AWS CloudFormation template to provision your cloud application and AWS resources.

To deploy the demo application described in this post, you need the following prerequisites:

  1. An AWS account.
  2. Familiarity with navigating the AWS Management Console or AWS CLI.
  3. Familiarity with deploying CloudFormation templates.

To deploy the template

  • Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template from GitHub, modify it, and deploy it to the selected Region.

The stack creates a Cognito user pool called ExternalIdPDemoPoolXXXX in the AWS Region that you have specified. The CloudFormation Outputs field contains a list of values that you will need for further configuration.

IdP configuration

The next step is to configure your IdP. Each IdP has its own procedure for configuration, but there are some common steps you need to follow.

To configure your IdP

  1. Provide the IdP with the values for the following two properties:
    • Single sign on URL / Assertion Consumer Service URL / ACS URL (for this example, https://<CognitoDomainURL>/saml2/idpresponse)
    • Audience URI / SP Entity ID / Entity ID: (For this example, urn:amazon:cognito:sp:<yourUserPoolID>)
  2. Configure the field mapping for the SAML response in the IdP. Map the first name, last name, email, and groups (as a multi-value attribute) into SAML response attributes with the names firstName, lastName, email, and groups, respectively.
    • Recommended: Filter the mapped groups to only those that are relevant to the application (for example, by a prefix filter). There is a 2,048-character limit on the custom attribute, so filtering helps avoid exceeding the character limit, and also helps avoid passing irrelevant information to the application.
  3. In each IdP, create two demo groups called pet-app-users and pet-app-admins, and create two demo users, for example, [email protected] and [email protected], and then assign one to each group, respectively.

To illustrate, we set up three different IdPs to represent three different tenants. Use the following links for instructions on how to configure each IdP:

You will need the metadata URL or file from each IdP, because you will use this to configure your user pool integration. For more information, see Integrating third-party SAML identity providers with Amazon Cognito user pools.

Cognito configuration

After your IdPs are configured and your CloudFormation stack is deployed, you can configure Cognito.

To configure Cognito

  1. Use your browser to navigate to the Cognito console, and for User pool name, select the Cognito user pool.
    Figure 4: Select the Cognito user pool

    Figure 4: Select the Cognito user pool

  2. On the Sign-in experience screen, on the Federated identity provider sign-in tab, choose Add identity provider.
  3. Choose SAML for the sign-in option, and then enter the values for your IdP. You can either upload the metadata XML file or provide the metadata endpoint URL. Add mapping for the attributes as shown in Figure 5.
    Figure 5: Attribute mappings for the IdP

    Figure 5: Attribute mappings for the IdP

    Upon completion you will see the new IdP displayed as shown in Figure 6.

    Figure 6: List of federated IdPs

    Figure 6: List of federated IdPs

  4. On the App integration tab, select the app client that was created by the CloudFormation template.
    Figure 7: Select the app client

    Figure 7: Select the app client

  5. Under Hosted UI, choose Edit. Under Identity providers, select the Identity Providers that you want to set up for federated login, and save the change.
    Figure 8: Select identity providers

    Figure 8: Select identity providers

API gateway

The example application uses a serverless backend. There are two API operations defined in this example, as shown in Figure 9. One operation gets tenant details and the other is the /pets API operation, which fetches information on pets based on user identity. The TenantMatch API operation will be run when you sign in with your email address. The operation passes your email address to the backend Lambda function.

Figure 9: Example APIs

Figure 9: Example APIs

Lambda functions

You will see three Lambda functions deployed in the example application, as shown in Figure 10.

Figure 10: Lambda functions

Figure 10: Lambda functions

The first one is GetTenantInfo, which is used for the TenantMatch API operation. It reads the data from the TenantTable based on the email domain and passes the record back to the application. The second function is PreTokenGeneration, which reads the mapped custom:groups attribute value, parses it, converts it to an array, and then stores it in the cognito:groups claim. The second Lambda function is invoked by the Cognito user pool after sign-in is successful. In order to customize the mapping, you can edit the Lambda function’s code in the index.js file and redeploy. The third Lambda function is added to support the Pets API operation.

DynamoDB tables

You will see three DynamoDB tables deployed in the example application, as shown in Figure 11.

Figure 11: DynamoDB tables

Figure 11: DynamoDB tables

The TenantTable table holds the tenant details where you must add the mapping between the customer domain and the IdP ID setup in Cognito. This approach can be expanded to add more flexibility in case you want to add custom redirect URLs or Cognito app IDs for each tenant. You must create entries to correspond to the IdPs you have configured, as shown in Figure 12.

Figure 12: Tenant IdP mappings table

Figure 12: Tenant IdP mappings table

In addition to TenantTable, there is the ExternalIdPDemo-ItemsTable table, which holds the data related to the Pets application, based on user identity. There is also ExternalIdPDemo-UsersTable, which holds user details like the username, last forced sign-out time, and TTL required for the application to manage the user session.

You can now sign in to the example application through each IdP by navigating to the application URL found in the CloudFormation Outputs section, as shown in Figure 13.

Figure 13: Cognito sign-in screen

Figure 13: Cognito sign-in screen

You will be redirected to the IdP, as shown in Figure 14.

Figure 14: Google Workspace sign-in screen

Figure 14: Google Workspace sign-in screen

The AWS Amplify frontend application parses the JWT to identify the tenant name and provide authorization based on group membership, as shown in Figure 15.

Figure 15: Application home screen upon successful sign-in

Figure 15: Application home screen upon successful sign-in

If a different user logs in with a different role, the AWS Amplify frontend application provides authorization based on specific content of the JWT.

Conclusion

You can integrate your application with your customer’s IdP of choice for authentication and authorization and map information from the IdP to the application. By using Amazon Cognito, you can normalize the structure of the JWT token that is used for this process, so that you can add multiple IdPs, each for a different tenant, through a single Cognito user pool. You can do all this without changing application code. The native integration of Amazon API Gateway with the Cognito user pools authorizer streamlines your validation of the JWT integrity, and after the JWT has been validated, you can use it to make authorization decisions in your application’s backend. By following the example in this post, you can focus on what differentiates your application, and let AWS do the undifferentiated heavy lifting of identity management for your customer-facing applications.

For the code examples described in this post, see the amazon-cognito-example-for-multi-tenant code repository on GitHub. To learn more about using Cognito with external IdPs, see the Amazon Cognito documentation. You can also learn to build software as a service (SaaS) application architectures on AWS. If you have any questions about Cognito or any other AWS services, you may post them to AWS re:Post.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ray Zaman

Ray Zaman

A Principal Solutions Architect with AWS, Ray has over 30 years of experience helping customers in finance, healthcare, insurance, manufacturing, media, petrochemical, pharmaceutical, public utility, retail, semiconductor, telecommunications, and waste management industries build technology solutions.

Neela Kulkarni

Neela Kulkarni

Neela is a Solutions Architect with Amazon Web Services. She primarily serves independent software vendors in the Northeast US, providing architectural guidance and best practice recommendations for new and existing workloads. Outside of work, she enjoys traveling, swimming, and spending time with her family.

Yuri Duchovny

Yuri Duchovny

Yuri is a New York–based Solutions Architect specializing in cloud security, identity, and compliance. He supports cloud transformations at large enterprises, helping them make optimal technology and organizational decisions. Prior to his AWS role, Yuri’s areas of focus included application and networking security, DoS, and fraud protection. Outside of work, he enjoys skiing, sailing, and traveling the world.

Abdul Qadir

Abdul Qadir

Abdul is an AWS Solutions Architect based in New Jersey. He works with independent software vendors in the Northeast US and helps customers build well-architected solutions on the AWS Cloud platform.

How AWS protects customers from DDoS events

Post Syndicated from Tom Scholl original https://aws.amazon.com/blogs/security/how-aws-protects-customers-from-ddos-events/

At Amazon Web Services (AWS), security is our top priority. Security is deeply embedded into our culture, processes, and systems; it permeates everything we do. What does this mean for you? We believe customers can benefit from learning more about what AWS is doing to prevent and mitigate customer-impacting security events.

Since late August 2023, AWS has detected and been protecting customer applications from a new type of distributed denial of service (DDoS) event. DDoS events attempt to disrupt the availability of a targeted system, such as a website or application, reducing the performance for legitimate users. Examples of DDoS events include HTTP request floods, reflection/amplification attacks, and packet floods. The DDoS events AWS detected were a type of HTTP/2 request flood, which occurs when a high volume of illegitimate web requests overwhelms a web server’s ability to respond to legitimate client requests.

Between August 28 and August 29, 2023, proactive monitoring by AWS detected an unusual spike in HTTP/2 requests to Amazon CloudFront, peaking at over 155 million requests per second (RPS). Within minutes, AWS determined the nature of this unusual activity and found that CloudFront had automatically mitigated a new type of HTTP request flood DDoS event, now called an HTTP/2 rapid reset attack. Over those two days, AWS observed and mitigated over a dozen HTTP/2 rapid reset events, and through the month of September, continued to see this new type of HTTP/2 request flood. AWS customers who had built DDoS-resilient architectures with services like Amazon CloudFront and AWS Shield were able to protect their applications’ availability.

Figure 1: Global HTTP requests per second, September 13 – 16

Figure 1. Global HTTP requests per second, September 13 – 16

Overview of HTTP/2 rapid reset attacks

HTTP/2 allows for multiple distinct logical connections to be multiplexed over a single HTTP session. This is a change from HTTP 1.x, in which each HTTP session was logically distinct. HTTP/2 rapid reset attacks consist of multiple HTTP/2 connections with requests and resets in rapid succession. For example, a series of requests for multiple streams will be transmitted followed up by a reset for each of those requests. The targeted system will parse and act upon each request, generating logs for a request that is then reset, or cancelled, by a client. The system performs work generating those logs even though it doesn’t have to send any data back to a client. A bad actor can abuse this process by issuing a massive volume of HTTP/2 requests, which can overwhelm the targeted system, such as a website or application.

Keep in mind that HTTP/2 rapid reset attacks are just a new type of HTTP request flood. To defend against these sorts of DDoS attacks, you can implement an architecture that helps you specifically detect unwanted requests as well as scale to absorb and block those malicious HTTP requests.

Building DDoS resilient architectures

As an AWS customer, you benefit from both the security built into the global cloud infrastructure of AWS as well as our commitment to continuously improve the security, efficiency, and resiliency of AWS services. For prescriptive guidance on how to improve DDoS resiliency, AWS has built tools such as the AWS Best Practices for DDoS Resiliency. It describes a DDoS-resilient reference architecture as a guide to help you protect your application’s availability. While several built-in forms of DDoS mitigation are included automatically with AWS services, your DDoS resilience can be improved by using an AWS architecture with specific services and by implementing additional best practices for each part of the network flow between users and your application.

For example, you can use AWS services that operate from edge locations, such as Amazon CloudFront, AWS Shield, Amazon Route 53, and Route 53 Application Recovery Controller to build comprehensive availability protection against known infrastructure layer attacks. These services can improve the DDoS resilience of your application when serving any type of application traffic from edge locations distributed around the world. Your application can be on-premises or in AWS when you use these AWS services to help you prevent unnecessary requests reaching your origin servers. As a best practice, you can run your applications on AWS to get the additional benefit of reducing the exposure of your application endpoints to DDoS attacks and to protect your application’s availability and optimize the performance of your application for legitimate users. You can use Amazon CloudFront (and its HTTP caching capability), AWS WAF, and Shield Advanced automatic application layer protection to help prevent unnecessary requests reaching your origin during application layer DDoS attacks.

Putting our knowledge to work for AWS customers

AWS remains vigilant, working to help prevent security issues from causing disruption to your business. We believe it’s important to share not only how our services are designed, but also how our engineers take deep, proactive ownership of every aspect of our services. As we work to defend our infrastructure and your data, we look for ways to help protect you automatically. Whenever possible, AWS Security and its systems disrupt threats where that action will be most impactful; often, this work happens largely behind the scenes. We work to mitigate threats by combining our global-scale threat intelligence and engineering expertise to help make our services more resilient against malicious activities. We’re constantly looking around corners to improve the efficiency and security of services including the protocols we use in our services, such as Amazon CloudFront, as well as AWS security tools like AWS WAF, AWS Shield, and Amazon Route 53 Resolver DNS Firewall.

In addition, our work extends security protections and improvements far beyond the bounds of AWS itself. AWS regularly works with the wider community, such as computer emergency response teams (CERT), internet service providers (ISP), domain registrars, or government agencies, so that they can help disrupt an identified threat. We also work closely with the security community, other cloud providers, content delivery networks (CDNs), and collaborating businesses around the world to isolate and take down threat actors. For example, in the first quarter of 2023, we stopped over 1.3 million botnet-driven DDoS attacks, and we traced back and worked with external parties to dismantle the sources of 230 thousand L7/HTTP DDoS attacks. The effectiveness of our mitigation strategies relies heavily on our ability to quickly capture, analyze, and act on threat intelligence. By taking these steps, AWS is going beyond just typical DDoS defense, and moving our protection beyond our borders. To learn more behind this effort, please read How AWS threat intelligence deters threat actors.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Tom Scholl

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Tom Scholl

Tom Scholl

Tom is Vice President and Distinguished Engineer at AWS.

Delegating permission set management and account assignment in AWS IAM Identity Center

Post Syndicated from Jake Barker original https://aws.amazon.com/blogs/security/delegating-permission-set-management-and-account-assignment-in-aws-iam-identity-center/

In this blog post, we look at how you can use AWS IAM Identity Center (successor to AWS Single Sign-On) to delegate the management of permission sets and account assignments. Delegating the day-to-day administration of user identities and entitlements allows teams to move faster and reduces the burden on your central identity administrators.

IAM Identity Center helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. Identity Center requires accounts to be managed by AWS Organizations. Administration of Identity Center can be delegated to a member account (an account other than the management account). We recommend that you delegate Identity Center administration to limit who has access to the management account and use the management account only for tasks that require the management account.

Delegated administration is different from the delegation of permission sets and account assignments, which this blog covers. For more information on delegated administration, see Getting started with AWS IAM Identity Center delegated administration. The patterns in this blog post work whether Identity Center is delegated to a member account or remains in the management account.

Permission sets are used to define the level of access that users or groups have to an AWS account. Permission sets can contain AWS managed policies, customer managed policies, inline policies, and permissions boundaries.

Solution overview

As your organization grows, you might want to start delegating permissions management and account assignment to give your teams more autonomy and reduce the burden on your identity team. Alternatively, you might have different business units or customers, operating out of their own organizational units (OUs), that want more control over their own identity management.

In this scenario, an example organization has three developer teams: Red, Blue, and Yellow. Each of the teams operate out of its own OU. IAM Identity Center has been delegated from the management account to the Identity Account. Figure 1 shows the structure of the example organization.

Figure 1: The structure of the organization in the example scenario

Figure 1: The structure of the organization in the example scenario

The organization in this scenario has an existing collection of permission sets. They want to delegate the management of permission sets and account assignments away from their central identity management team.

  • The Red team wants to be able to assign the existing permission sets to accounts in its OU. This is an accounts-based model.
  • The Blue team wants to edit and use a single permission set and then assign that set to the team’s single account. This is a permission-based model.
  • The Yellow team wants to create, edit, and use a permission set tagged with Team: Yellow and then assign that set to all of the accounts in its OU. This is a tag-based model.

We’ll look at the permission sets needed for these three use cases.

Note: If you’re using the AWS Management Console, additional permissions are required.

Use case 1: Accounts-based model

In this use case, the Red team is given permission to assign existing permission sets to the three accounts in its OU. This will also include permissions to remove account assignments.

Using this model, an organization can create generic permission sets that can be assigned to its AWS accounts. It helps reduce complexity for the delegated administrators and verifies that they are using permission sets that follow the organization’s best practices. These permission sets restrict access based on services and features within those services, rather than specific resources.

{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "sso:CreateAccountAssignment",
            "sso:DeleteAccountAssignment",
            "sso:ProvisionPermissionSet"
        ],
        "Resource": [
            "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
            "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "arn:aws:sso:::account/112233445566",
            "arn:aws:sso:::account/223344556677",
            "arn:aws:sso:::account/334455667788"

        ]
    }
  ]
}

In the preceding policy, the principal can assign existing permission sets to the three AWS accounts with the IDs 112233445566, 223344556677 and 334455667788. This includes administration permission sets, so carefully consider which accounts you allow the permission sets to be assigned to.

The arn:aws:sso:::instance/ssoins-<sso-ins-id> is the IAM Identity Center instance ID ARN. It can be found using either the AWS Command Line Interface (AWS CLI) v2 with the list-instances API or the AWS Management Console.

Use the AWS CLI

Use the AWS Command Line Interface (AWS CLI) to run the following command:

aws sso-admin list-instances

You can also use AWS CloudShell to run the command.

Use the AWS Management Console

Use the Management Console to navigate to the IAM Identity Center in your AWS Region and then select Choose your identity source on the dashboard.

Figure 2: The IAM Identity Center instance ID ARN in the console

Figure 2: The IAM Identity Center instance ID ARN in the console

Use case 2: Permission-based model

For this example, the Blue team is given permission to edit one or more specific permission sets and then assign those permission sets to a single account. The following permissions allow the team to use managed and inline policies.

This model allows the delegated administrator to use fine-grained permissions on a specific AWS account. It’s useful when the team wants total control over the permissions in its AWS account, including the ability to create additional roles with administrative permissions. In these cases, the permissions are often better managed by the team that operates the account because it has a better understanding of the services and workloads.

Granting complete control over permissions can lead to unintended or undesired outcomes. Permission sets are still subject to IAM evaluation and authorization, which means that service control policies (SCPs) can be used to deny specific actions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
            "sso:AttachManagedPolicyToPermissionSet",
            "sso:CreateAccountAssignment",
            "sso:DeleteAccountAssignment",
            "sso:DeleteInlinePolicyFromPermissionSet",
            "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet",
            "sso:DetachManagedPolicyFromPermissionSet",
            "sso:ProvisionPermissionSet",
            "sso:PutInlinePolicyToPermissionSet",
            "sso:UpdatePermissionSet"
        ],
        "Resource": [
            "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
            "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788",
            "arn:aws:sso:::account/445566778899"
        ]
    }
  ]
}

Here, the principal can edit the permission set arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788 and assign it to the AWS account 445566778899. The editing rights include customer managed policies, AWS managed policies, and inline policies.

If you want to use the preceding policy, replace the missing and example resource values with your own IAM Identity Center instance ID and account numbers.

In the preceding policy, the arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788 is the permission set ARN. You can find this ARN through the console, or by using the AWS CLI command to list all of the permission sets:

aws sso-admin list-permission-sets --instance-arn <instance arn from above>

This permission set can also be applied to multiple accounts—similar to the first use case—by adding additional account IDs to the list of resources. Likewise, additional permission sets can be added so that the user can edit multiple permission sets and assign them to a set of accounts.

Use case 3: Tag-based model

For this example, the Yellow team is given permission to create, edit, and use permission sets tagged with Team: Yellow. Then they can assign those tagged permission sets to all of their accounts.

This example can be used by an organization to allow a team to freely create and edit permission sets and then assign them to the team’s accounts. It uses tagging as a mechanism to control which permission sets can be created and edited. Permission sets without the correct tag cannot be altered.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreatePermissionSet",
                "sso:DescribePermissionSet",
                "sso:UpdatePermissionSet",
                "sso:DeletePermissionSet",
                "sso:DescribePermissionSetProvisioningStatus",
                "sso:DescribeAccountAssignmentCreationStatus",
                "sso:DescribeAccountAssignmentDeletionStatus",
                "sso:TagResource"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:DescribePermissionSet",
                "sso:UpdatePermissionSet",
                "sso:DeletePermissionSet"
                	
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreatePermissionSet"
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "sso:TagResource",
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow",
                    "aws:RequestTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreateAccountAssignment",
                "sso:DeleteAccountAssignment",
                "sso:ProvisionPermissionSet"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
                "arn:aws:sso:::account/556677889900",
                "arn:aws:sso:::account/667788990011",
                "arn:aws:sso:::account/778899001122"
            ]
        },
        {
            "Sid": "InlinePolicy",
            "Effect": "Allow",
            "Action": [
                "sso:GetInlinePolicyForPermissionSet",
                "sso:PutInlinePolicyToPermissionSet",
                "sso:DeleteInlinePolicyFromPermissionSet"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
            ]
        },
        {
            "Sid": "InlinePolicyABAC",
            "Effect": "Allow",
            "Action": [
                "sso:GetInlinePolicyForPermissionSet",
                "sso:PutInlinePolicyToPermissionSet",
                "sso:DeleteInlinePolicyFromPermissionSet"
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow"
                }
            }
        }
    ]
}

In the preceding policy, the principal is allowed to create new permission sets only with the tag Team: Yellow, and assign only permission sets tagged with Team: Yellow to the AWS accounts with ID 556677889900, 667788990011, and 778899001122.

The principal can only edit the inline policies of the permission sets tagged with Team: Yellow and cannot change the tags of the permission sets that are already tagged for another team.

If you want to use this policy, replace the missing and example resource values with your own IAM Identity Center instance ID, tags, and account numbers.

Note: The policy above assumes that there are no additional statements applying to the principal. If you require additional allow statements, verify that the resulting policy doesn’t create a risk of privilege escalation. You can review Controlling access to AWS resources using tags for additional information.

This policy only allows the delegation of permission sets using inline policies. Customer managed policies are IAM policies that are deployed to and are unique to each AWS account. When you create a permission set with a customer managed policy, you must create an IAM policy with the same name and path in each AWS account where IAM Identity Center assigns the permission set. If the IAM policy doesn’t exist, Identity Center won’t make the account assignment. For more information on how to use customer managed policies with Identity Center, see How to use customer managed policies in AWS IAM Identity Center for advanced use cases.

You can extend the policy to allow the delegation of customer managed policies with these two statements:

{
    "Sid": "CustomerManagedPolicy",
    "Effect": "Allow",
    "Action": [
        "sso:ListCustomerManagedPolicyReferencesInPermissionSet",
        "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
        "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet"
    ],
    "Resource": [
        "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
    ]
},
{
    "Sid": "CustomerManagedABAC",
    "Effect": "Allow",
    "Action": [
        "sso:ListCustomerManagedPolicyReferencesInPermissionSet",
        "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
        "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet"
    ],
    "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
    "Condition": {
        "StringEquals": {
            "aws:ResourceTag/Team": "Yellow"
        }
    }
}

Note: Both statements are required, as only the resource type PermissionSet supports the condition key aws:ResourceTag/${TagKey}, and the actions listed require access to both the Instance and PermissionSet resource type. See Actions, resources, and condition keys for AWS IAM Identity Center for more information.

Best practices

Here are some best practices to consider when delegating management of permission sets and account assignments:

  • Assign permissions to edit specific permission sets. Allowing roles to edit every permission set could allow that role to edit their own permission set.
  • Only allow administrators to manage groups. Users with rights to edit group membership could add themselves to any group, including a group reserved for organization administrators.

If you’re using IAM Identity Center in a delegated account, you should also be aware of the best practices for delegated administration.

Summary

Organizations can empower teams by delegating the management of permission sets and account assignments in IAM Identity Center. Delegating these actions can allow teams to move faster and reduce the burden on the central identity management team.

The scenario and examples share delegation concepts that can be combined and scaled up within your organization. If you have feedback about this blog post, submit comments in the Comments section. If you have questions, start a new thread on AWS Re:Post with the Identity Center tag.

Want more AWS Security news? Follow us on Twitter.

Jake Barker

Jake Barker

Jake is a Senior Security Consultant with AWS Professional Services. He loves making security accessible to customers and eating great pizza.

Roberto Migli

Roberto Migli

Roberto is a Principal Solutions Architect at AWS. Roberto supports global financial services customers, focusing on security and identity and access management. In his free time he enjoys building electronics gadgets and spending time with his family.

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

Post Syndicated from Rana Dutt original https://aws.amazon.com/blogs/big-data/using-aws-appsync-and-aws-lake-formation-to-access-a-secure-data-lake-through-a-graphql-api/

Data lakes have been gaining popularity for storing vast amounts of data from diverse sources in a scalable and cost-effective way. As the number of data consumers grows, data lake administrators often need to implement fine-grained access controls for different user profiles. They might need to restrict access to certain tables or columns depending on the type of user making the request. Also, businesses sometimes want to make data available to external applications but aren’t sure how to do so securely. To address these challenges, organizations can turn to GraphQL and AWS Lake Formation.

GraphQL provides a powerful, secure, and flexible way to query and retrieve data. AWS AppSync is a service for creating GraphQL APIs that can query multiple databases, microservices, and APIs from one unified GraphQL endpoint.

Data lake administrators can use Lake Formation to govern access to data lakes. Lake Formation offers fine-grained access controls for managing user and group permissions at the table, column, and cell level. It can therefore ensure data security and compliance. Additionally, this Lake Formation integrates with other AWS services, such as Amazon Athena, making it ideal for querying data lakes through APIs.

In this post, we demonstrate how to build an application that can extract data from a data lake through a GraphQL API and deliver the results to different types of users based on their specific data access privileges. The example application described in this post was built by AWS Partner NETSOL Technologies.

Solution overview

Our solution uses Amazon Simple Storage Service (Amazon S3) to store the data, AWS Glue Data Catalog to house the schema of the data, and Lake Formation to provide governance over the AWS Glue Data Catalog objects by implementing role-based access. We also use Amazon EventBridge to capture events in our data lake and launch downstream processes. The solution architecture is shown in the following diagram.

Appsync and LakeFormation Arch itecture diagram

Figure 1 – Solution architecture

The following is a step by step description of the solution:

  1. The data lake is created in an S3 bucket registered with Lake Formation. Whenever new data arrives, an EventBridge rule is invoked.
  2. The EventBridge rule runs an AWS Lambda function to start an AWS Glue crawler to discover new data and update any schema changes so that the latest data can be queried.
    Note: AWS Glue crawlers can also be launched directly from Amazon S3 events, as described in this blog post.
  3. AWS Amplify allows users to sign in using Amazon Cognito as an identity provider. Cognito authenticates the user’s credentials and returns access tokens.
  4. Authenticated users invoke an AWS AppSync GraphQL API through Amplify, fetching data from the data lake. A Lambda function is run to handle the request.
  5. The Lambda function retrieves the user details from Cognito and assumes the AWS Identity and Access Management (IAM) role associated with the requesting user’s Cognito user group.
  6. The Lambda function then runs an Athena query against the data lake tables and returns the results to AWS AppSync, which then returns the results to the user.

Prerequisites

To deploy this solution, you must first do the following:

git clone [email protected]:aws-samples/aws-appsync-with-lake-formation.git
cd aws-appsync-with-lake-formation

Prepare Lake Formation permissions

Sign in to the LakeFormation console and add yourself as an administrator. If you’re signing in to Lake Formation for the first time, you can do this by selecting Add myself on the Welcome to Lake Formation screen and choosing Get started as shown in Figure 2.

Figure 2 – Add yourself as the Lake Formation administrator

Otherwise, you can choose Administrative roles and tasks in the left navigation bar and choose Manage Administrators to add yourself. You should see your IAM username under Data lake administrators with Full access when done.

Select Data catalog settings in the left navigation bar and make sure the two IAM access control boxes are not selected, as shown in Figure 3. You want Lake Formation, not IAM, to control access to new databases.

Lake Formation data catalog settings

Figure 3 – Lake Formation data catalog settings

Deploy the solution

To create the solution in your AWS environment, launch the following AWS CloudFormation stack:  Launch Cloudformation Stack

The following resources will be launched through the CloudFormation template:

  • Amazon VPC and networking components (subnets, security groups, and NAT gateway)
  • IAM roles
  • Lake Formation encapsulating S3 bucket, AWS Glue crawler, and AWS Glue database
  • Lambda functions
  • Cognito user pool
  • AWS AppSync GraphQL API
  • EventBridge rules

After the required resources have been deployed from the CloudFormation stack, you must create two Lambda functions and upload the dataset to Amazon S3. Lake Formation will govern the data lake that is stored in the S3 bucket.

Create the Lambda functions

Whenever a new file is placed in the designated S3 bucket, an EventBridge rule is invoked, which launches a Lambda function to initiate the AWS Glue crawler. The crawler updates the AWS Glue Data Catalog to reflect any changes to the schema.

When the application makes a query for data through the GraphQL API, a request handler Lambda function is invoked to process the query and return the results.

To create these two Lambda functions, proceed as follows.

  1. Sign in to the Lambda console.
  2. Select the request handler Lambda function named dl-dev-crawlerLambdaFunction.
  3. Find the crawler Lambda function file in your lambdas/crawler-lambda folder in the git repo that you cloned to your local machine.
  4. Copy and paste the code in that file to the Code section of the dl-dev-crawlerLambdaFunction in your Lambda console. Then choose Deploy to deploy the function.
Copy and paste code into the Lambda function

Figure 4 – Copy and paste code into the Lambda function

  1. Repeat steps 2 through 4 for the request handler function named dl-dev-requestHandlerLambdaFunction using the code in lambdas/request-handler-lambda.

Create a layer for the request handler Lambda

You now must upload some additional library code needed by the request handler Lambda function.

  1. Select Layers in the left menu and choose Create layer.
  2. Enter a name such as appsync-lambda-layer.
  3. Download this package layer ZIP file to your local machine.
  4. Upload the ZIP file using the Upload button on the Create layer page.
  5. Choose Python 3.7 as the runtime for the layer.
  6. Choose Create.
  7. Select Functions on the left menu and select the dl-dev-requestHandler Lambda function.
  8. Scroll down to the Layers section and choose Add a layer.
  9. Select the Custom layers option and then select the layer you created above.
  10. Click Add.

Upload the data to Amazon S3

Navigate to the root directory of the cloned git repository and run the following commands to upload the sample dataset. Replace the bucket_name placeholder with the S3 bucket provisioned using the CloudFormation template. You can get the bucket name from the CloudFormation console by going to the Outputs tab with key datalakes3bucketName as shown in image below.

Figure 5 – S3 bucket name shown in CloudFormation Outputs tab

Figure 5 – S3 bucket name shown in CloudFormation Outputs tab

Enter the following commands in your project folder in your local machine to upload the dataset to the S3 bucket.

cd dataset
aws s3 cp . s3://bucket_name/ --recursive

Now let’s take a look at the deployed artifacts.

Data lake

The S3 bucket holds sample data for two entities: companies and their respective owners. The bucket is registered with Lake Formation, as shown in Figure 6. This enables Lake Formation to create and manage data catalogs and manage permissions on the data.

Figure 6 – Lake Formation console showing data lake location

Figure 6 – Lake Formation console showing data lake location

A database is created to hold the schema of data present in Amazon S3. An AWS Glue crawler is used to update any change in schema in the S3 bucket. This crawler is granted permission to CREATE, ALTER, and DROP tables in the database using Lake Formation.

Apply data lake access controls

Two IAM roles are created, dl-us-east-1-developer and dl-us-east-1-business-analyst, each assigned to a different Cognito user group. Each role is assigned different authorizations through Lake Formation. The Developer role gains access to every column in the data lake, while the Business Analyst role is only granted access to the non-personally identifiable information (PII) columns.

Lake Formation console data lake permissions assigned to group roles

Figure 7 –Lake Formation console data lake permissions assigned to group roles

GraphQL schema

The GraphQL API is viewable from the AWS AppSync console. The Companies type includes several attributes describing the owners of the companies.

Schema for GraphQL API

Figure 8 – Schema for GraphQL API

The data source for the GraphQL API is a Lambda function, which handles the requests.

– AWS AppSync data source mapped to Lambda function

Figure 9 – AWS AppSync data source mapped to Lambda function

Handling the GraphQL API requests

The GraphQL API request handler Lambda function retrieves the Cognito user pool ID from the environment variables. Using the boto3 library, you create a Cognito client and use the get_group method to obtain the IAM role associated to the Cognito user group.

You use a helper function in the Lambda function to obtain the role.

def get_cognito_group_role(group_name):
    response = cognito_idp_client.get_group(
            GroupName=group_name,
            UserPoolId=cognito_user_pool_id
        )
    print(response)
    role_arn = response.get('Group').get('RoleArn')
    return role_arn

Using the AWS Security Token Service (AWS STS) through a boto3 client, you can assume the IAM role and obtain the temporary credentials you need to run the Athena query.

def get_temp_creds(role_arn):
    response = sts_client.assume_role(
        RoleArn=role_arn,
        RoleSessionName='stsAssumeRoleAthenaQuery',
    )
    return response['Credentials']['AccessKeyId'],
response['Credentials']['SecretAccessKey'],  response['Credentials']['SessionToken']

We pass the temporary credentials as parameters when creating our Boto3 Amazon Athena client.

athena_client = boto3.client('athena', aws_access_key_id=access_key, aws_secret_access_key=secret_key, aws_session_token=session_token)

The client and query are passed into our Athena query helper function which executes the query and returns a query id. With the query id, we are able to read the results from S3 and bundle it as a Python dictionary to be returned in the response.

def get_query_result(s3_client, output_location):
    bucket, object_key_path = get_bucket_and_path(output_location)
    response = s3_client.get_object(Bucket=bucket, Key=object_key_path)
    status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
    result = []
    if status == 200:
        print(f"Successful S3 get_object response. Status - {status}")
        df = pandas.read_csv(response.get("Body"))
        df = df.fillna('')
        result = df.to_dict('records')
        print(result)
    else:
        print(f"Unsuccessful S3 get_object response. Status - {status}")
    return result

Enabling client-side access to the data lake

On the client side, AWS Amplify is configured with an Amazon Cognito user pool for authentication. We’ll navigate to the Amazon Cognito console to view the user pool and groups that were created.

Figure 10 –Amazon Cognito User pools

Figure 10 –Amazon Cognito User pools

For our sample application we have two groups in our user pool:

  • dl-dev-businessAnalystUserGroup – Business analysts with limited permissions.
  • dl-dev-developerUserGroup – Developers with full permissions.

If you explore these groups, you’ll see an IAM role associated to each. This is the IAM role that is assigned to the user when they authenticate. Athena assumes this role when querying the data lake.

If you view the permissions for this IAM role, you’ll notice that it doesn’t include access controls below the table level. You need the additional layer of governance provided by Lake Formation to add fine-grained access control.

After the user is verified and authenticated by Cognito, Amplify uses access tokens to invoke the AWS AppSync GraphQL API and fetch the data. Based on the user’s group, a Lambda function assumes the corresponding Cognito user group role. Using the assumed role, an Athena query is run and the result returned to the user.

Create test users

Create two users, one for dev and one for business analyst, and add them to user groups.

  1. Navigate to Cognito and select the user pool, dl-dev-cognitoUserPool, that’s created.
  2. Choose Create user and provide the details to create a new business analyst user. The username can be biz-analyst. Leave the email address blank, and enter a password.
  3. Select the Users tab and select the user you just created.
  4. Add this user to the business analyst group by choosing the Add user to group button.
  5. Follow the same steps to create another user with the username developer and add the user to the developers group.

Test the solution

To test your solution, launch the React application on your local machine.

  1. In the cloned project directory, navigate to the react-app directory.
  2. Install the project dependencies.
npm install
  1. Install the Amplify CLI:
npm install -g @aws-amplify/cli
  1. Create a new file called .env by running the following commands. Then use a text editor to update the environment variable values in the file.
echo export REACT_APP_APPSYNC_URL=Your AppSync endpoint URL > .env
echo export REACT_APP_CLIENT_ID=Your Cognito app client ID >> .env
echo export REACT_APP_USER_POOL_ID=Your Cognito user pool ID >> .env

Use the Outputs tab of your CloudFormation console stack to get the required values from the keys as follows:

REACT_APP_APPSYNC_URL appsyncApiEndpoint
REACT_APP_CLIENT_ID cognitoUserPoolClientId
REACT_APP_USER_POOL_ID cognitoUserPoolId
  1. Add the preceding variables to your environment.
source .env
  1. Generate the code needed to interact with the API using Amplify CodeGen. In the Outputs tab of your Cloudformation console, find your AWS Appsync API ID next to the appsyncApiId key.
amplify add codegen --apiId <appsyncApiId>

Accept all the default options for the above command by pressing Enter at each prompt.

  1. Start the application.
npm start

You can confirm that the application is running by visiting http://localhost:3000 and signing in as the developer user you created earlier.

Now that you have the application running, let’s take a look at how each role is served from the companies endpoint.

First, sign is as the developer role, which has access to all the fields, and make the API request to the companies endpoint. Note which fields you have access to.

The results for developer role

Figure 11 –The results for developer role

Now, sign in as the business analyst user and make the request to the same endpoint and compare the included fields.

The results for Business Analyst role

Figure 12 –The results for Business Analyst role

The First Name and Last Name columns of the companies list is excluded in the business analyst view even though you made the request to the same endpoint. This demonstrates the power of using one unified GraphQL endpoint together with multiple Cognito user group IAM roles mapped to Lake Formation permissions to manage role-based access to your data.

Cleaning up

After you’re done testing the solution, clean up the following resources to avoid incurring future charges:

  1. Empty the S3 buckets created by the CloudFormation template.
  2. Delete the CloudFormation stack to remove the S3 buckets and other resources.

Conclusion

In this post, we showed you how to securely serve data in a data lake to authenticated users of a React application based on their role-based access privileges. To accomplish this, you used GraphQL APIs in AWS AppSync, fine-grained access controls from Lake Formation, and Cognito for authenticating users by group and mapping them to IAM roles. You also used Athena to query the data.

For related reading on this topic, see Visualizing big data with AWS AppSync, Amazon Athena, and AWS Amplify and Design a data mesh architecture using AWS Lake Formation and AWS Glue.

Will you implement this approach for serving data from your data lake? Let us know in the comments!


About the Authors

Rana Dutt is a Principal Solutions Architect at Amazon Web Services. He has a background in architecting scalable software platforms for financial services, healthcare, and telecom companies, and is passionate about helping customers build on AWS.

Ranjith Rayaprolu is a Senior Solutions Architect at AWS working with customers in the Pacific Northwest. He helps customers design and operate Well-Architected solutions in AWS that address their business problems and accelerate the adoption of AWS services. He focuses on AWS security and networking technologies to develop solutions in the cloud across different industry verticals. Ranjith lives in the Seattle area and loves outdoor activities.

Justin Leto is a Sr. Solutions Architect at Amazon Web Services with specialization in databases, big data analytics, and machine learning. His passion is helping customers achieve better cloud adoption. In his spare time, he enjoys offshore sailing and playing jazz piano. He lives in New York City with his wife and baby daughter.

Simplify data transfer: Google BigQuery to Amazon S3 using Amazon AppFlow

Post Syndicated from Kartikay Khator original https://aws.amazon.com/blogs/big-data/simplify-data-transfer-google-bigquery-to-amazon-s3-using-amazon-appflow/

In today’s data-driven world, the ability to effortlessly move and analyze data across diverse platforms is essential. Amazon AppFlow, a fully managed data integration service, has been at the forefront of streamlining data transfer between AWS services, software as a service (SaaS) applications, and now Google BigQuery. In this blog post, you explore the new Google BigQuery connector in Amazon AppFlow and discover how it simplifies the process of transferring data from Google’s data warehouse to Amazon Simple Storage Service (Amazon S3), providing significant benefits for data professionals and organizations, including the democratization of multi-cloud data access.

Overview of Amazon AppFlow

Amazon AppFlow is a fully managed integration service that you can use to securely transfer data between SaaS applications such as Google BigQuery, Salesforce, SAP, Hubspot, and ServiceNow, and AWS services such as Amazon S3 and Amazon Redshift, in just a few clicks. With Amazon AppFlow, you can run data flows at nearly any scale at the frequency you choose—on a schedule, in response to a business event, or on demand. You can configure data transformation capabilities such as filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps. Amazon AppFlow automatically encrypts data in motion, and allows you to restrict data from flowing over the public internet for SaaS applications that are integrated with AWS PrivateLink, reducing exposure to security threats.

Introducing the Google BigQuery connector

The new Google BigQuery connector in Amazon AppFlow unveils possibilities for organizations seeking to use the analytical capability of Google’s data warehouse, and to effortlessly integrate, analyze, store, or further process data from BigQuery, transforming it into actionable insights.

Architecture

Let’s review the architecture to transfer data from Google BigQuery to Amazon S3 using Amazon AppFlow.

architecture

  1. Select a data source: In Amazon AppFlow, select Google BigQuery as your data source. Specify the tables or datasets you want to extract data from.
  2. Field mapping and transformation: Configure the data transfer using the intuitive visual interface of Amazon AppFlow. You can map data fields and apply transformations as needed to align the data with your requirements.
  3. Transfer frequency: Decide how frequently you want to transfer data—such as daily, weekly, or monthly—supporting flexibility and automation.
  4. Destination: Specify an S3 bucket as the destination for your data. Amazon AppFlow will efficiently move the data, making it accessible in your Amazon S3 storage.
  5. Consumption: Use Amazon Athena to analyze the data in Amazon S3.

Prerequisites

The dataset used in this solution is generated by Synthea, a synthetic patient population simulator and opensource project under the Apache License 2.0. Load this data into Google BigQuery or use your existing dataset.

Connect Amazon AppFlow to your Google BigQuery account

For this post, you use a Google account, OAuth client with appropriate permissions, and Google BigQuery data. To enable Google BigQuery access from Amazon AppFlow, you must set up a new OAuth client in advance. For instructions, see Google BigQuery connector for Amazon AppFlow.

Set up Amazon S3

Every object in Amazon S3 is stored in a bucket. Before you can store data in Amazon S3, you must create an S3 bucket to store the results.

Create a new S3 bucket for Amazon AppFlow results

To create an S3 bucket, complete the following steps:

  1. On the AWS Management console for Amazon S3, choose Create bucket.
  2. Enter a globally unique name for your bucket; for example, appflow-bq-sample.
  3. Choose Create bucket.

Create a new S3 bucket for Amazon Athena results

To create an S3 bucket, complete the following steps:

  1. On the AWS Management console for Amazon S3, choose Create bucket.
  2. Enter a globally unique name for your bucket; for example, athena-results.
  3. Choose Create bucket.

User role (IAM role) for AWS Glue Data Catalog

To catalog the data that you transfer with your flow, you must have the appropriate user role in AWS Identity and Access Management (IAM). You provide this role to Amazon AppFlow to grant the permissions it needs to create an AWS Glue Data Catalog, tables, databases, and partitions.

For an example IAM policy that has the required permissions, see Identity-based policy examples for Amazon AppFlow.

Walkthrough of the design

Now, let’s walk through a practical use case to see how the Amazon AppFlow Google BigQuery to Amazon S3 connector works. For the use case, you will use Amazon AppFlow to archive historical data from Google BigQuery to Amazon S3 for long-term storage an analysis.

Set up Amazon AppFlow

Create a new Amazon AppFlow flow to transfer data from Google Analytics to Amazon S3.

  1. On the Amazon AppFlow console, choose Create flow.
  2. Enter a name for your flow; for example, my-bq-flow.
  3. Add necessary Tags; for example, for Key enter env and for Value enter dev.

appflow-flow-setup­­­­

  1. Choose Next.
  2. For Source name, choose Google BigQuery.
  3. Choose Create new connection.
  4. Enter your OAuth Client ID and Client Secret, then name your connection; for example, bq-connection.

­bq-connection

  1. In the pop-up window, choose to allow amazon.com access to the Google BigQuery API.

bq-authentication

  1. For Choose Google BigQuery object, choose Table.
  2. For Choose Google BigQuery subobject, choose BigQueryProjectName.
  3. For Choose Google BigQuery subobject, choose DatabaseName.
  4. For Choose Google BigQuery subobject, choose TableName.
  5. For Destination name, choose Amazon S3.
  6. For Bucket details, choose the Amazon S3 bucket you created for storing Amazon AppFlow results in the prerequisites.
  7. Enter raw as a prefix.

appflow-source-destination

  1. Next, provide AWS Glue Data Catalog settings to create a table for further analysis.
    1. Select the User role (IAM role) created in the prerequisites.
    2. Create new database for example, healthcare.
    3. Provide a table prefix setting for example, bq.

glue-crawler-config

  1. Select Run on demand.

appflow-trigger-setup

  1. Choose Next.
  2. Select Manually map fields.
  3. Select the following six fields for Source field name from the table Allergies:
    1. Start
    2. Patient
    3. Code
    4. Description
    5. Type
    6. Category
  4. Choose Map fields directly.

appflow-field-mapping

  1. Choose Next.
  2. In the Add filters section, choose Next.
  3. Choose Create flow.

Run the flow

After creating your new flow, you can run it on demand.

  1. On the Amazon AppFlow console, choose my-bq-flow.
  2. Choose Run flow.

sppflow-run--status

For this walkthrough, choose run the job on-demand for ease of understanding. In practice, you can choose a scheduled job and periodically extract only newly added data.

Query through Amazon Athena

When you select the optional AWS Glue Data Catalog settings, Data Catalog creates the catalog for the data, allowing Amazon Athena to perform queries.

If you’re prompted to configure a query results location, navigate to the Settings tab and choose Manage. Under Manage settings, choose the Athena results bucket created in prerequisites and choose Save.

  1. On the Amazon Athena console, select the Data Source as AWSDataCatalog.
  2. Next, select Database as healthcare.
  3. Now you can select the table created by the AWS Glue crawler and preview it.

athena-results

  1. You can also run a custom query to find the top 10 allergies as shown in the following query.

Note: In the below query, replace the table name, in this case bq_appflow_mybqflow_1693588670_latest, with the name of the table generated in your AWS account.

SELECT type,
category,
"description",
count(*) as number_of_cases
FROM "healthcare"."bq_appflow_mybqflow_1693588670_latest"
GROUP BY type,
category,
"description"
ORDER BY number_of_cases DESC
LIMIT 10;

  1. Choose Run query.

athena-custom-query-results

This result shows the top 10 allergies by number of cases.

Clean up

To avoid incurring charges, clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. From the list of flows, select the flow my-bq-flow, and delete it.
  3. Enter delete to delete the flow.
  4. Choose Connections in the navigation pane.
  5. Choose Google BigQuery from the list of connectors, select bq-connector, and delete it.
  6. Enter delete to delete the connector.
  7. On the IAM console, choose Roles in the navigation page, then select the role you created for AWS Glue crawler and delete it.
  8. On the Amazon Athena console:
    1. Delete the tables created under the database healthcare using AWS Glue crawler.
    2. Drop the database healthcare
  9. On the Amazon S3 console, search for the Amazon AppFlow results bucket you created, choose Empty to delete the objects, then delete the bucket.
  10. On the Amazon S3 console, search for the Amazon Athena results bucket you created, choose Empty to delete the objects, then delete the bucket.
  11. Clean up resources in your Google account by deleting the project that contains the Google BigQuery resources. Follow the documentation to clean up the Google resources.

Conclusion

The Google BigQuery connector in Amazon AppFlow streamlines the process of transferring data from Google’s data warehouse to Amazon S3. This integration simplifies analytics and machine learning, archiving, and long-term storage, providing significant benefits for data professionals and organizations seeking to harness the analytical capabilities of both platforms.

With Amazon AppFlow, the complexities of data integration are eliminated, enabling you to focus on deriving actionable insights from your data. Whether you’re archiving historical data, performing complex analytics, or preparing data for machine learning, this connector simplifies the process, making it accessible to a broader range of data professionals.

If you’re interested to see how the data transfer from Google BigQuery to Amazon S3 using Amazon AppFlow, take a look at step-by-step video tutorial. In this tutorial, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on Amazon AppFlow, visit Amazon AppFlow.


About the authors

Kartikay Khator is a Solutions Architect on the Global Life Science at Amazon Web Services. He is passionate about helping customers on their cloud journey with focus on AWS analytics services. He is an avid runner and enjoys hiking.

Kamen SharlandjievKamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.