Tag Archives: Amazon DynamoDB

Accelerate Amazon DynamoDB data access in AWS Glue jobs using the new AWS Glue DynamoDB Export connector

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/accelerate-amazon-dynamodb-data-access-in-aws-glue-jobs-using-the-new-aws-glue-dynamodb-elt-connector/

Modern data architectures encourage the integration of data lakes, data warehouses, and purpose-built data stores, enabling unified governance and easy data movement. With a modern data architecture on AWS, you can store data in a data lake and use a ring of purpose-built data services around the lake, allowing you to make decisions with speed and agility.

To achieve a modern data architecture, AWS Glue is the key service that integrates data over a data lake, data warehouse, and purpose-built data stores. AWS Glue simplifies data movement like inside-out, outside-in, or around the perimeter. A powerful purpose-built data store is Amazon DynamoDB, which is widely used by hundreds of thousands of companies, including Amazon.com. It’s common to move data from DynamoDB to a data lake built on top of Amazon Simple Storage Service (Amazon S3). Many customers move data from DynamoDB to Amazon S3 using AWS Glue extract, transform, and load (ETL) jobs.

Today, we’re pleased to announce the general availability of a new AWS Glue DynamoDB export connector. It’s built on top of the DynamoDB table export feature. It’s a scalable and cost-efficient way to read large DynamoDB table data in AWS Glue ETL jobs. This post describes the benefit of this new export connector and its use cases.

The following are typical use cases to read from DynamoDB tables using AWS Glue ETL jobs:

  • Move the data from DynamoDB tables to different data stores
  • Integrate the data with other services and applications
  • Retain historical snapshots for auditing
  • Build an S3 data lake from the DynamoDB data and analyze the data from various services, such as Amazon Athena, Amazon Redshift, and Amazon SageMaker

The new AWS Glue DynamoDB export connector

The old version of the AWS Glue DynamoDB connector reads DynamoDB tables through the DynamoDB Scan API. Instead, the new AWS Glue DynamoDB export connector reads DynamoDB data from the snapshot, which is exported from DynamoDB tables. This approach has following benefits:

  • It doesn’t consume read capacity units of the source DynamoDB tables
  • The read performance is consistent for large DynamoDB tables

Especially for large DynamoDB tables more than 100 GB, this new connector is significantly faster than the traditional connector.

To use this new export connector, you need to enable point-in-time recovery (PITR) for the source DynamoDB table in advance.

How to use the new connector on AWS Glue Studio Visual Editor

AWS Glue Studio Visual Editor is a graphical interface that makes it easy to create, run, and monitor AWS Glue ETL jobs in AWS Glue. The new DynamoDB export connector is available on AWS Glue Studio Visual Editor. You can choose Amazon DynamoDB as the source.

After you choose Create, you see the visual Directed Acyclic Graph (DAG). Here, you can choose your DynamoDB table that exists in this account or Region. This allows you to select DynamoDB tables (with PITR enabled) directly as a source in AWS Glue Studio. This provides a one-click export from any of your DynamoDB tables to Amazon S3. You can also easily add any data sources and targets or transformations to the DAG. For example, it allows you to join two different DynamoDB tables and export the result to Amazon S3, as shown in the following screenshot.

The following two connection options are automatically added. This location is used to store temporary data during the DynamoDB export phase. You can set S3 bucket lifecycle policies to expire temporary data.

  • dynamodb.s3.bucket – The S3 bucket to store temporary data during DynamoDB export
  • dynamodb.s3.prefix – The S3 prefix to store temporary data during DynamoDB export

How to use the new connector on the job script code

You can use the new export connector when you create an AWS Glue DynamicFrame in the job script code by configuring the following connection options:

  • dynamodb.export – (Required) You need to set this to ddb or s3
  • dynamodb.tableArn – (Required) Your source DynamoDB table ARN
  • dynamodb.unnestDDBJson – (Optional) If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. The default value is false.
  • dynamodb.s3.bucket – (Optional) The S3 bucket to store temporary data during DynamoDB export
  • dynamodb.s3.prefix – (Optional) The S3 prefix to store temporary data during DynamoDB export

The following is the sample Python code to create a DynamicFrame using the new export connector:

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "test_source",
        "dynamodb.unnestDDBJson": True,
        "dynamodb.s3.bucket": "bucket name",
        "dynamodb.s3.prefix": "bucket prefix"
    }
)

The new export connector doesn’t require configurations related to AWS Glue job parallelism, unlike the old connector. Now you no longer need to change the configuration when you scale out the AWS Glue job. It also doesn’t require any configuration regarding DynamoDB table read/write capacity and its capacity mode (on demand or provisioned).

DynamoDB table schema handling

By default, the new export connector reads data in DynamoDB JSON structure that is present in exports. The following is an example schema of the frame using the Amazon Customer Review Dataset:

root
|-- Item: struct (nullable = true)
| |-- product_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- review_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- total_votes: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- product_title: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- star_rating: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- customer_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- marketplace: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- helpful_votes: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- review_headline: struct (nullable = true)
| | |-- S: string (nullable = true)
| | |-- NULL: boolean (nullable = true)
| |-- review_date: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- vine: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- review_body: struct (nullable = true)
| | |-- S: string (nullable = true)
| | |-- NULL: boolean (nullable = true)
| |-- verified_purchase: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- product_category: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- year: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- product_parent: struct (nullable = true)
| | |-- S: string (nullable = true)

To read DynamoDB item columns without handling nested data, you can set dynamodb.unnestDDBJson to True. The following is an example of the schema of the same data where dynamodb.unnestDDBJson is set to True:

root
|-- product_id: string (nullable = true)
|-- review_id: string (nullable = true)
|-- total_votes: string (nullable = true)
|-- product_title: string (nullable = true)
|-- star_rating: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- marketplace: string (nullable = true)
|-- helpful_votes: string (nullable = true)
|-- review_headline: string (nullable = true)
|-- review_date: string (nullable = true)
|-- vine: string (nullable = true)
|-- review_body: string (nullable = true)
|-- verified_purchase: string (nullable = true)
|-- product_category: string (nullable = true)
|-- year: string (nullable = true)
|-- product_parent: string (nullable = true)

Data freshness

Data freshness is the measure of staleness of the data from the live tables in the original source. In the new export connecor, the option dynamodb.export impacts data freshness.

When dynamodb.export is set to ddb, the AWS Glue job invokes a new export and then reads the export placed in an S3 bucket into DynamicFrame. It reads exports of the live table, so data can be fresh. On the other hand, when dynamodb.export is set to s3, the AWS Glue job skips invoking a new export and directly reads an export already placed in an S3 bucket. It reads exports of the past table, so data can be stale, but you can reduce overhead to trigger the exports.

The following table explains the data freshness and pros and cons of each option.

.. dynamodb.export Config Data Freshness Data Source Pros Cons
New export connector s3 Stale Export of the past table
  • RCU is not consumed
  • Can skip triggering exports
  • Data can be stale
New export connector ddb Fresh Export of the live table
  • Data can be fresh
  • RCU is not consumed
  • Overhead to trigger exports and wait for completion
Old connector N/A Most fresh Scan of the live tables
  • Data can be fresh
  • Read capacity unit (RCU) is consumed

Performance

The following benchmark shows the performance improvements between the old version of the AWS Glue DynamoDB connector and the new export connector. The comparison uses the DynamoDB tables storing the TPC-DS benchmark dataset with different scales from 10 MB to 2 TB. The sample Spark job reads from the DynamoDB table and calculates the count of the items. All the Spark jobs are run on AWS Glue 3.0, G.2X, 60 workers.

The following chart compares AWS Glue job duration between the old connector and the new export connector. For small DynamoDB tables, the old connector is faster. For large tables more than 80 GB, the new export connector is faster. In other words, the DynamoDB export connector is recommended for jobs that take the old connector more than 5–10 minutes to run. Also, the chart shows that the duration of the new export connector increases slowly as data size increases, although the duration of the old connector increases rapidly as data size increases. This means that the new export connector is suitable especially for larger tables.

The following chart compares dollar cost between the old connector and the new export connector. It contains the AWS Glue DPU hour cost summed with the cost for reading data from DynamoDB. For the old connector, we include the read request cost. For the new export connector, we include the cost in the DynamoDB data export to Amazon S3. Both are calculated in DynamoDB on-demand capacity mode.

With AWS Glue Auto Scaling

AWS Glue Auto Scaling is a new feature to automatically resize computing resources for better performance at lower cost. You can take advantage of AWS Glue Auto Scaling with the new DynamoDB export connector.

As the following chart shows, with AWS Glue Auto Scaling, the duration of the new export connector is shorter than the old connector when the size of the source DynamoDB table is 100 GB or more. It shows a similar trend without AWS Glue Auto Scaling.

You get the cost benefits as only Spark driver is active for most of the time duration during the DynamoDB export (which is nearly 30% of the total job duration time with the old scan-based connector).

Conclusion

AWS Glue is a key service to integrate with multiple data stores. At AWS, we keep improving the performance and cost-efficiency of our services. In this post, we announced the availability of the new AWS Glue DynamoDB export connector. With this new connector, you can easily integrate your large data on DynamoDB tables with different data stores. It helps you read the large tables faster from AWS Glue jobs at lower cost.

The new AWS Glue DynamoDB export connector is now generally available in all supported Glue Regions. Let’s start using the new AWS Glue DynamoDB export connector today! We are looking forward to your feedback and stories on how you utilize the connector for your needs.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts that help customers build data lakes on the cloud.

Neil Gupta is a Software Development Engineer on the AWS Glue team. He enjoys tackling big data problems and learning more about distributed systems.

Andrew Kim is a Software Development Engineer on the AWS Glue team. His passion is to build scalable and effective solutions to challenging problems and working with distributed systems.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.

Optimize Federated Query Performance using EXPLAIN and EXPLAIN ANALYZE in Amazon Athena

Post Syndicated from Nishchai JM original https://aws.amazon.com/blogs/big-data/optimize-federated-query-performance-using-explain-and-explain-analyze-in-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. In 2019, Athena added support for federated queries to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

In 2021, Athena added support for the EXPLAIN statement, which can help you understand and improve the efficiency of your queries. The EXPLAIN statement provides a detailed breakdown of a query’s run plan. You can analyze the plan to identify and reduce query complexity and improve its runtime. You can also use EXPLAIN to validate SQL syntax prior to running the query. Doing so helps prevent errors that would have occurred while running the query.

Athena also added EXPLAIN ANALYZE, which displays the computational cost of your queries alongside their run plans. Administrators can benefit from using EXPLAIN ANALYZE because it provides a scanned data count, which helps you reduce financial impact due to user queries and apply optimizations for better cost control.

In this post, we demonstrate how to use and interpret EXPLAIN and EXPLAIN ANALYZE statements to improve Athena query performance when querying multiple data sources.

Solution overview

To demonstrate using EXPLAIN and EXPLAIN ANALYZE statements, we use the following services and resources:

Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your AWS account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query. We use Athena data source connectors to connect to data sources external to Amazon S3.

Prerequisites

To deploy the CloudFormation template, you must have the following:

Provision resources with AWS CloudFormation

To deploy the CloudFormation template, complete the following steps:

  1. Choose Launch Stack:

  1. Follow the prompts on the AWS CloudFormation console to create the stack.
  2. Note the key-value pairs on the stack’s Outputs tab.

You use these values when configuring the Athena data source connectors.

The CloudFormation template creates the following resources:

  • S3 buckets to store data and act as temporary spill buckets for Lambda
  • AWS Glue Data Catalog tables for the data in the S3 buckets
  • A DynamoDB table and Amazon RDS for MySQL tables, which are used to join multiple tables from different sources
  • A VPC, subnets, and endpoints, which are needed for Amazon RDS for MySQL and DynamoDB

The following figure shows the high-level data model for the data load.

Create the DynamoDB data source connector

To create the DynamoDB connector for Athena, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select Amazon DynamoDB.
  4. Choose Next.

  1. For Data source name, enter DDB.

  1. For Lambda function, choose Create Lambda function.

This opens a new tab in your browser.

  1. For Application name, enter AthenaDynamoDBConnector.
  2. For SpillBucket, enter the value from the CloudFormation stack for AthenaSpillBucket.
  3. For AthenaCatalogName, enter dynamodb-lambda-func.
  4. Leave the remaining values at their defaults.
  5. Select I acknowledge that this app creates custom IAM roles and resource policies.
  6. Choose Deploy.

You’re returned to the Connect data sources section on the Athena console.

  1. Choose the refresh icon next to Lambda function.
  2. Choose the Lambda function you just created (dynamodb-lambda-func).

  1. Choose Next.
  2. Review the settings and choose Create data source.
  3. If you haven’t already set up the Athena query results location, choose View settings on the Athena query editor page.

  1. Choose Manage.
  2. For Location of query result, browse to the S3 bucket specified for the Athena spill bucket in the CloudFormation template.
  3. Add Athena-query to the S3 path.
  4. Choose Save.

  1. In the Athena query editor, for Data source, choose DDB.
  2. For Database, choose default.

You can now explore the schema for the sportseventinfo table; the data is the same in DynamoDB.

  1. Choose the options icon for the sportseventinfo table and choose Preview Table.

Create the Amazon RDS for MySQL data source connector

Now let’s create the connector for Amazon RDS for MySQL.

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select MySQL.
  4. Choose Next.

  1. For Data source name, enter MySQL.

  1. For Lambda function, choose Create Lambda function.

  1. For Application name, enter AthenaMySQLConnector.
  2. For SecretNamePrefix, enter AthenaMySQLFederation.
  3. For SpillBucket, enter the value from the CloudFormation stack for AthenaSpillBucket.
  4. For DefaultConnectionString, enter the value from the CloudFormation stack for MySQLConnection.
  5. For LambdaFunctionName, enter mysql-lambda-func.
  6. For SecurityGroupIds, enter the value from the CloudFormation stack for RDSSecurityGroup.
  7. For SubnetIds, enter the value from the CloudFormation stack for RDSSubnets.
  8. Select I acknowledge that this app creates custom IAM roles and resource policies.
  9. Choose Deploy.

  1. On the Lambda console, open the function you created (mysql-lambda-func).
  2. On the Configuration tab, under Environment variables, choose Edit.

  1. Choose Add environment variable.
  2. Enter a new key-value pair:
    • For Key, enter MYSQL_connection_string.
    • For Value, enter the value from the CloudFormation stack for MySQLConnection.
  3. Choose Save.

  1. Return to the Connect data sources section on the Athena console.
  2. Choose the refresh icon next to Lambda function.
  3. Choose the Lambda function you created (mysql-lamdba-function).

  1. Choose Next.
  2. Review the settings and choose Create data source.
  3. In the Athena query editor, for Data Source, choose MYSQL.
  4. For Database, choose sportsdata.

  1. Choose the options icon by the tables and choose Preview Table to examine the data and schema.

In the following sections, we demonstrate different ways to optimize our queries.

Optimal join order using EXPLAIN plan

A join is a basic SQL operation to query data on multiple tables using relations on matching columns. Join operations affect how much data is read from a table, how much data is transferred to the intermediate stages through networks, and how much memory is needed to build up a hash table to facilitate a join.

If you have multiple join operations and these join tables aren’t in the correct order, you may experience performance issues. To demonstrate this, we use the following tables from difference sources and join them in a certain order. Then we observe the query runtime and improve performance by using the EXPLAIN feature from Athena, which provides some suggestions for optimizing the query.

The CloudFormation template you ran earlier loaded data into the following services:

AWS Storage Table Name Number of Rows
Amazon DynamoDB sportseventinfo 657
Amazon S3 person 7,025,585
Amazon S3 ticketinfo 2,488

Let’s construct a query to find all those who participated in the event by type of tickets. The query runtime with the following join took approximately 7 mins to complete:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."person" p, 
"AwsDataCatalog"."athenablog"."ticketinfo" t 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

Now let’s use EXPLAIN on the query to see its run plan. We use the same query as before, but add explain (TYPE DISTRIBUTED):

EXPLAIN (TYPE DISTRIBUTED)
SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."person" p, 
"AwsDataCatalog"."athenablog"."ticketinfo" t 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following screenshot shows our output

Notice the cross-join in Fragment 1. The joins are converted to a Cartesian product for each table, where every record in a table is compared to every record in another table. Therefore, this query takes a significant amount of time to complete.

To optimize our query, we can rewrite it by reordering the joining tables as sportseventinfo first, ticketinfo second, and person last. The reason for this is because the WHERE clause, which is being converted to a JOIN ON clause during the query plan stage, doesn’t have the join relationship between the person table and sportseventinfo table. Therefore, the query plan generator converted the join type to cross-joins (a Cartesian product), which less efficient. Reordering the tables aligns the WHERE clause to the INNER JOIN type, which satisfies the JOIN ON clause and runtime is reduced from 7 minutes to 10 seconds.

The code for our optimized query is as follows:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."ticketinfo" t, 
"AwsDataCatalog"."athenablog"."person" p 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following is the EXPLAIN output of our query after reordering the join clause:

EXPLAIN (TYPE DISTRIBUTED) 
SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."ticketinfo" t, 
"AwsDataCatalog"."athenablog"."person" p 
WHERE t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following screenshot shows our output.

The cross-join changed to INNER JOIN with join on columns (eventid, id, ticketholder_id), which results in the query running faster. Joins between the ticketinfo and person tables converted to the PARTITION distribution type, where both left and right tables are hash-partitioned across all worker nodes due to the size of the person table. The join between the sportseventinfo table and ticketinfo are converted to the REPLICATED distribution type, where one table is hash-partitioned across all worker nodes and the other table is replicated to all worker nodes to perform the join operation.

For more information about how to analyze these results, refer to Understanding Athena EXPLAIN statement results.

As a best practice, we recommend having a JOIN statement along with an ON clause, as shown in the following code:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"AwsDataCatalog"."athenablog"."person" p 
JOIN "AwsDataCatalog"."athenablog"."ticketinfo" t ON t.ticketholder_id = p.id 
JOIN "ddb"."default"."sportseventinfo" e ON t.sporting_event_id = cast(e.eventid as double)

Also as a best practice when you join two tables, specify the larger table on the left side of join and the smaller table on the right side of the join. Athena distributes the table on the right to worker nodes, and then streams the table on the left to do the join. If the table on the right is smaller, then less memory is used and the query runs faster.

In the following sections, we present examples of how to optimize pushdowns for filter predicates and projection filter operations for the Athena data source using EXPLAIN ANALYZE.

Pushdown optimization for the Athena connector for Amazon RDS for MySQL

A pushdown is an optimization to improve the performance of a SQL query by moving its processing as close to the data as possible. Pushdowns can drastically reduce SQL statement processing time by filtering data before transferring it over the network and filtering data before loading it into memory. The Athena connector for Amazon RDS for MySQL supports pushdowns for filter predicates and projection pushdowns.

The following table summarizes the services and tables we use to demonstrate a pushdown using Aurora MySQL.

Table Name Number of Rows Size in KB
player_partitioned 5,157 318.86
sport_team_partitioned 62 5.32

We use the following query as an example of a filtering predicate and projection filter:

SELECT full_name,
name 
FROM "sportsdata"."player_partitioned" a 
JOIN "sportsdata"."sport_team_partitioned" b ON a.sport_team_id=b.id 
WHERE a.id='1.0'

This query selects the players and their team based on their ID. It serves as an example of both filter operations in the WHERE clause and projection because it selects only two columns.

We use EXPLAIN ANALYZE to get the cost for the running this query:

EXPLAIN ANALYZE 
SELECT full_name,
name 
FROM "MYSQL"."sportsdata"."player_partitioned" a 
JOIN "MYSQL"."sportsdata"."sport_team_partitioned" b ON a.sport_team_id=b.id 
WHERE a.id='1.0'

The following screenshot shows the output in Fragment 2 for the table player_partitioned, in which we observe that the connector has a successful pushdown filter on the source side, so it tries to scan only one record out of the 5,157 records in the table. The output also shows that the query scan has only two columns (full_name as the projection column and sport_team_id and the join column), and uses SELECT and JOIN, which indicates the projection pushdown is successful. This helps reduce the data scan when using Athena data source connectors.

Now let’s look at the conditions in which a filter predicate pushdown doesn’t work with Athena connectors.

LIKE statement in filter predicates

We start with the following example query to demonstrate using the LIKE statement in filter predicates:

SELECT * 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

We then add EXPLAIN ANALYZE:

EXPLAIN ANALYZE 
SELECT * 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

The EXPLAIN ANALYZE output shows that the query performs the table scan (scanning the table player_partitioned, which contains 5,157 records) for all the records even though the WHERE clause only has 30 records matching the condition %Aar%. Therefore, the data scan shows the complete table size even with the WHERE clause.

We can optimize the same query by selecting only the required columns:

EXPLAIN ANALYZE 
SELECT sport_team_id,
full_name 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

From the EXPLAIN ANALYZE output, we can observe that the connector supports the projection filter pushdown, because we select only two columns. This brought the data scan size down to half of the table size.

OR statement in filter predicates

We start with the following query to demonstrate using the OR statement in filter predicates:

SELECT id,
first_name 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name = 'Aaron' OR id ='1.0'

We use EXPLAIN ANALYZE with the preceding query as follows:

EXPLAIN ANALYZE 
SELECT * 
FROM 
"MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name = 'Aaron' OR id ='1.0'

Similar to the LIKE statement, the following output shows that query scanned the table instead of pushing down to only the records that matched the WHERE clause. This query outputs only 16 records, but the data scan indicates a complete scan.

Pushdown optimization for the Athena connector for DynamoDB

For our example using the DynamoDB connector, we use the following data:

Table Number of Rows Size in KB
sportseventinfo 657 85.75

Let’s test the filter predicate and project filter operation for our DynamoDB table using the following query. This query tries to get all the events and sports for a given location. We use EXPLAIN ANALYZE for the query as follows:

EXPLAIN ANALYZE 
SELECT EventId,
Sport 
FROM "DDB"."default"."sportseventinfo" 
WHERE Location = 'Chase Field'

The output of EXPLAIN ANALYZE shows that the filter predicate retrieved only 21 records, and the project filter selected only two columns to push down to the source. Therefore, the data scan for this query is less than the table size.

Now let’s see where filter predicate pushdown doesn’t work. In the WHERE clause, if you apply the TRIM() function to the Location column and then filter, predicate pushdown optimization doesn’t apply, but we still see the projection filter optimization, which does apply. See the following code:

EXPLAIN ANALYZE 
SELECT EventId,
Sport 
FROM "DDB"."default"."sportseventinfo" 
WHERE trim(Location) = 'Chase Field'

The output of EXPLAIN ANALYZE for this query shows that the query scans all the rows but is still limited to only two columns, which shows that the filter predicate doesn’t work when the TRIM function is applied.

We’ve seen from the preceding examples that the Athena data source connector for Amazon RDS for MySQL and DynamoDB do support filter predicates and projection predicates for pushdown optimization, but we also saw that operations such as LIKE, OR, and TRIM when used in the filter predicate don’t support pushdowns to the source. Therefore, if you encounter unexplained charges in your federated Athena query, we recommend using EXPLAIN ANALYZE with the query and determine whether your Athena connector supports the pushdown operation or not.

Please note that running EXPLAIN ANALYZE incurs cost because it scans the data.

Conclusion

In this post, we showcased how to use EXPLAIN and EXPLAIN ANALYZE to analyze Athena SQL queries for data sources on AWS S3 and Athena federated SQL query for data source like DynamoDB and Amazon RDS for MySQL. You can use this as an example to optimize queries which would also result in cost savings.


About the Authors

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Varad Ram is Senior Solutions Architect in Amazon Web Services. He likes to help customers adopt to cloud technologies and is particularly interested in artificial intelligence. He believes deep learning will power future technology growth. In his spare time, he like to be outdoor with his daughter and son.

Use direct service integrations to optimize your architecture

Post Syndicated from Jerome Van Der Linden original https://aws.amazon.com/blogs/architecture/use-direct-service-integrations-to-optimize-your-architecture/

When designing an application, you must integrate and combine several AWS services in the most optimized way for an effective and efficient architecture:

  • Optimize for performance by reducing the latency between services
  • Optimize for costs operability and sustainability, by avoiding unnecessary components and reducing workload footprint
  • Optimize for resiliency by removing potential point of failures
  • Optimize for security by minimizing the attack surface

As stated in the Serverless Application Lens of the Well-Architected Framework, “If your AWS Lambda function is not performing custom logic while integrating with other AWS services, chances are that it may be unnecessary.” In addition, Amazon API Gateway, AWS AppSync, AWS Step Functions, Amazon EventBridge, and Lambda Destinations can directly integrate with a number of services. These optimizations can offer you more value and less operational overhead.

This blog post will show how to optimize an architecture with direct integration.

Workflow example and initial architecture

Figure 1 shows a typical workflow for the creation of an online bank account. The customer fills out a registration form with personal information and adds a picture of their ID card. The application then validates ID and address, and scans if there is already an existing user by that name. If everything checks out, a backend application will be notified to create the account. Finally, the user is notified of successful completion.

Figure 1. Bank account application workflow

Figure 1. Bank account application workflow

The workflow architecture is shown in Figure 2 (click on the picture to get full resolution).

Figure 2. Initial account creation architecture

Figure 2. Initial account creation architecture

This architecture contains 13 Lambda functions. If you look at the code on GitHub, you can see that:

Five of these Lambda functions are basic and perform simple operations:

Additional Lambda functions perform other tasks, such as verification and validation:

  • One function generates a presigned URL to upload ID card pictures to Amazon Simple Storage Service (Amazon S3)
  • One function uses the Amazon Textract API to extract information from the ID card
  • One function verifies the identity of the user against the information extracted from the ID card
  • One function performs simple HTTP request to a third-party API to validate the address

Finally, four functions concern the websocket (connect, message, and disconnect) and notifications to the user.

Opportunities for improvement

If you further analyze the code of the five basic functions (see startWorkflow on GitHub, for example), you will notice that there are actually three lines of fundamental code that start the workflow. The others 38 lines involve imports, input validation, error handling, logging, and tracing. Remember that all this code must be tested and maintained.

import os
import json
import boto3
from aws_lambda_powertools import Tracer
from aws_lambda_powertools import Logger
import re

logger = Logger()
tracer = Tracer()

sfn = boto3.client('stepfunctions')

PATTERN = re.compile(r"^arn:(aws[a-zA-Z-]*)?:states:[a-z]{2}((-gov)|(-iso(b?)))?-[a-z]+-\d{1}:\d{12}:stateMachine:[a-zA-Z0-9-_]+$")

if ('STATE_MACHINE_ARN' not in os.environ
    or os.environ['STATE_MACHINE_ARN'] is None
    or not PATTERN.match(os.environ['STATE_MACHINE_ARN'])):
    raise RuntimeError('STATE_MACHINE_ARN env var is not set or incorrect')

STATE_MACHINE_ARN = os.environ['STATE_MACHINE_ARN']

@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
    try:
        event['requestId'] = context.aws_request_id

        sfn.start_execution(
            stateMachineArn=STATE_MACHINE_ARN,
            input=json.dumps(event)
        )

        return {
            'requestId': event['requestId']
        }
    except Exception as error:
        logger.exception(error)
        raise RuntimeError('Internal Error - cannot start the creation workflow') from error

After running this workflow several times and reviewing the AWS X-Ray traces (Figure 3), we can see that it takes about 2–3 seconds when functions are warmed:

Figure 3. X-Ray traces when Lambda functions are warmed

Figure 3. X-Ray traces when Lambda functions are warmed

But the process takes around 10 seconds with cold starts, as shown in Figure 4:

Figure 4. X-Ray traces when Lambda functions are cold

Figure 4. X-Ray traces when Lambda functions are cold

We use an asynchronous architecture to avoid waiting time for the user, as this can be a long process. We also use WebSockets to notify the user when it’s finished. This adds some complexity, new components, and additional costs to the architecture. Now let’s look at how we can optimize this architecture.

Improving the initial architecture

Direct integration with Step Functions

Step Functions can directly integrate with some AWS services, including DynamoDB, Amazon SQS, and EventBridge, and more than 10,000 APIs from 200+ AWS services. With these integrations, you can replace Lambda functions when they do not provide value. We recommend using Lambda functions to transform data, not to transport data from one service to another.

In our bank account creation use case, there are four Lambda functions we can replace with direct service integrations (see large arrows in Figure 5):

  • Query a DynamoDB table to search for a user
  • Send a message to an SQS queue when the extraction fails
  • Create the user in DynamoDB
  • Send an event on EventBridge to notify the backend
Figure 5. Lambda functions that can be replaced

Figure 5. Lambda functions that can be replaced

It is not as clear that we need to replace the other Lambda functions. Here are some considerations:

  • To extract information from the ID card, we use Amazon Textract. It is available through the SDK integration in Step Functions. However, the API’s response provides too much information. We recommend using a library such as amazon-textract-response-parser to parse the result. For this, you’ll need a Lambda function.
  • The identity cross-check performs a simple comparison between the data provided in the web form and the one extracted in the ID card. We can perform this comparison in Step Functions using a Choice state and several conditions. If the business logic becomes more complex, consider using a Lambda function.
  • To validate the address, we query a third-party API. Step Functions cannot directly call a third-party HTTP endpoint, but because it’s integrated with API Gateway, we can create a proxy for this endpoint.

If you only need to retrieve data from an API or make a simple API call, use the direct integration. If you need to implement some logic, use a Lambda function.

Direct integration with API Gateway

API Gateway also provides service integrations. In particular, we can start the workflow without using a Lambda function. In the console, select the integration type “AWS Service”, the AWS service “Step Functions”, the action “StartExecution”, and “POST” method, as shown in Figure 6.

Figure 6. API Gateway direct integration with Step Functions

Figure 6. API Gateway direct integration with Step Functions

After that, use a mapping template in the integration request to define the parameters as shown here:

{
  "stateMachineArn":"arn:aws:states:eu-central-1:123456789012:stateMachine: accountCreationWorkflow",
  "input":"$util.escapeJavaScript($input.json('$'))"
}

We can go further and remove the websockets and associated Lambda functions connect, message, and disconnect. By using Synchronous Express Workflows and the StartSyncExecution API, we can start the workflow and wait for the result in a synchronous fashion. API Gateway will then directly return the result of the workflow to the client.

Final optimized architecture

After applying these optimizations, we have the updated architecture shown in Figure 7. It uses only two Lambda functions out of the initial 13. The rest have been replaced by direct service integrations or implemented in Step Functions.

Figure 7. Final optimized architecture

Figure 7. Final optimized architecture

We were able to remove 11 Lambda functions and their associated fees. In this architecture, the cost is mainly driven by Step Functions, and the main price difference will be your use of Express Workflows instead of Standard Workflows. If you need to keep some Lambda functions, use AWS Lambda Power Tuning to configure your function correctly and benefit from the best price/performance ratio.

One of the main benefits of this architecture is performance. With the final workflow architecture, it now takes about 1.5 seconds when the Lambda function is warmed and 3 seconds on cold starts (versus up to 10 seconds previously), see Figure 8:

Figure 8. X-Ray traces for the final architecture

Figure 8. X-Ray traces for the final architecture

The process can now be synchronous. It reduces the complexity of the architecture and vastly improves the user experience.

An added benefit is that by reducing the overall complexity and removing the unnecessary Lambda functions, we have also reduced the risk of failures. These can be errors in the code, memory or timeout issues due to bad configuration, lack of permissions, network issues between components, and more. This increases the resiliency of the application and eases its maintenance.

Testing

Testability is an important consideration when building your workflow. Unit testing a Lambda function is straightforward, and you can use your preferred testing framework and validate methods. Adopting a hexagonal architecture also helps remove dependencies to the cloud.

When removing functions and using an approach with direct service integrations, you are by definition directly connected to the cloud. You still must verify that the overall process is working as expected, and validate these integrations.

You can achieve this kind of tests locally using Step Functions Local, and the recently announced Mocked Service Integrations. By mocking service integrations, for example, retrieving an item in DynamoDB, you can validate the different paths of your state machine.

You also have to perform integration tests, but this is true whether you use direct integrations or Lambda functions.

Conclusion

This post describes how to simplify your architecture and optimize for performance, resiliency, and cost by using direct integrations in Step Functions and API Gateway. Although many Lambda functions were reduced, some remain useful for handling more complex business logic and data transformation. Try this out now by visiting the GitHub repository.

For further reading:

Throttling a tiered, multi-tenant REST API at scale using API Gateway: Part 2

Post Syndicated from Nick Choi original https://aws.amazon.com/blogs/architecture/throttling-a-tiered-multi-tenant-rest-api-at-scale-using-api-gateway-part-2/

In Part 1 of this blog series, we demonstrated why tiering and throttling become necessary at scale for multi-tenant REST APIs, and explored tiering strategy and throttling with Amazon API Gateway.

In this post, Part 2, we will examine tenant isolation strategies at scale with API Gateway and extend the sample code from Part 1.

Enhancing the sample code

To enable this functionality in the sample code (Figure 1), we will make manual changes. First, create one API key for the Free Tier and five API keys for the Basic Tier. Currently, these API keys are private keys for your Amazon Cognito login, but we will make a further change in the backend business logic that will promote them to pooled resources. Note that all of these modifications are specific to this sample code’s implementation; the implementation and deployment of a production code may be completely different (Figure 1).

Cloud architecture of the sample code

Figure 1. Cloud architecture of the sample code

Next, in the business logic for thecreateKey(), find the AWS Lambda function in lambda/create_key.js.  It appears like this:

function createKey(tableName, key, plansTable, jwt, rand, callback) {
  const pool = getPoolForPlanId( key.planId ) 
  if (!pool) {
    createSiloedKey(tableName, key, plansTable, jwt, rand, callback);
  } else {
    createPooledKey(pool, tableName, key, jwt, callback);
  }
}

The getPoolForPlanId() function does a search for a pool of keys associated with the usage plan. If there is a pool, we “create” a kind of reference to the pooled resource, rather than a completely new key that is created by the API Gateway service directly. The lambda/api_key_pools.js should be empty.

exports.apiKeyPools = [];

In effect, all usage plans were considered as siloed keys up to now. To change that, populate the data structure with values from the six API keys that were created manually. You will have to look up the IDs of the API keys and usage plans that were created in API Gateway (Figures 2 and 3). Using the AWS console to navigate to API Gateway is the most intuitive.

A view of the AWS console when inspecting the ID for the Basic usage plan

Figure 2. A view of the AWS console when inspecting the ID for the Basic usage plan

A view of the AWS Console when looking up the API key value (not the ID)

Figure 3. A view of the AWS Console when looking up the API key value (not the ID)

When done, your code in lambda/api_key_pools.js should be the following, but instead of ellipses (), the IDs for the user plans and API keys specific to your environment will appear.

exports.apiKeyPools = [{
    planName: "FreePlan"
    planId: "...",
    apiKeys: [ "..." ]
  },
 {
    planName: "BasicPlan"
    planId: "...",
    apiKeys: [ "...", "...", "...", "...", "..." ]
  }];

After making the code changes, run cdk deploy from the command line to update the Lambda functions. This change will only affect key creation and deletion because of the system implementation. Updates affect only the user’s specific reference to the key, not the underlying resource managed by API Gateway.

When the web application is run now, it will look similar to before—tenants should not be aware what tiering strategy they have been assigned to. The only way to notice the difference would be to create two Free Tier keys, test them, and note that the value of the X-API-KEY header is unchanged between the two.

Now, you have a virtually unlimited number of users who can have API keys in the Free or Basic Tier. By keeping the Premium Tier siloed, you are subject to the 10,000-API-key maximum (less any keys allocated for the lower tiers). You may consider additional techniques to continue to scale, such as replicating your service in another AWS account.

Other production considerations

The sample code is minimal, and it illustrates just one aspect of scaling a Software-as-a-service (SaaS) application. There are many other aspects be considered in a production setting that we explore in this section.

The throttled endpoint, GET /api rely only on API key for authorization for demonstration purpose. For any production implementation consider authentication options for your REST APIs. You may explore and extend to require authentication with Cognito similar to /admin/* endpoints in the sample code.

One API key for Free Tier access and five API keys for Basic Tier access are illustrative in a sample code but not representative of production deployments. Number of API keys with service quota into consideration, business and technical decisions may be made to minimize noisy neighbor effect such as setting blast radius upper threshold of 0.1% of all users. To satisfy that requirement, each tier would need to spread users across at least 1,000 API keys. The number of keys allocated to Basic or Premium Tier would depend on market needs and pricing strategies. Additional allocations of keys could be held in reserve for troubleshooting, QA, tenant migrations, and key retirement.

In the planning phase of your solution, you will decide how many tiers to provide, how many usage plans are needed, and what throttle limits and quotas to apply. These decisions depend on your architecture and business.

To define API request limits, examine the system API Gateway is protecting and what load it can sustain. For example, if your service will scale up to 1,000 requests per second, it is possible to implement three tiers with a 10/50/40 split: the lowest tier shares one common API key with a 100 request per second limit; an intermediate tier has a pool of 25 API keys with a limit of 20 requests per second each; and the highest tier has a maximum of 10 API keys, each supporting 40 requests per second.

Metrics play a large role in continuously evolving your SaaS-tiering strategy (Figure 4). They provide rich insights into how tenants are using the system. Tenant-aware and SaaS-wide metrics on throttling and quota limits can be used to: assess tiering in-place, if tenants’ requirements are being met, and if currently used tenant usage profiles are valid (Figure 5).

Tiering strategy example with 3 tiers and requests allocation per tier

Figure 4. Tiering strategy example with 3 tiers and requests allocation per tier

An example SaaS metrics dashboard

Figure 5. An example SaaS metrics dashboard

API Gateway provides options for different levels of granularity required, including detailed metrics, and execution and access logging to enable observability of your SaaS solution. Granular usage metrics combined with underlying resource consumption leads to managing optimal experience for your tenants with throttling levels and policies per method and per client.

Cleanup

To avoid incurring future charges, delete the resources. This can be done on the command line by typing:

cd ${TOP}/cdk
cdk destroy

cd ${TOP}/react
amplify delete

${TOP} is the topmost directory of the sample code. For the most up-to-date information, see the README.md file.

Conclusion

In this two-part blog series, we have reviewed the best practices and challenges of effectively guarding a tiered multi-tenant REST API hosted in AWS API Gateway. We also explored how throttling policy and quota management can help you continuously evaluate the needs of your tenants and evolve your tiering strategy to protect your backend systems from being overwhelmed by inbound traffic.

Further reading:

Throttling a tiered, multi-tenant REST API at scale using API Gateway: Part 1

Post Syndicated from Nick Choi original https://aws.amazon.com/blogs/architecture/throttling-a-tiered-multi-tenant-rest-api-at-scale-using-api-gateway-part-1/

Many software-as-a-service (SaaS) providers adopt throttling as a common technique to protect a distributed system from spikes of inbound traffic that might compromise reliability, reduce throughput, or increase operational cost. Multi-tenant SaaS systems have an additional concern of fairness; excessive traffic from one tenant needs to be selectively throttled without impacting the experience of other tenants. This is also known as “the noisy neighbor” problem. AWS itself enforces some combination of throttling and quota limits on nearly all its own service APIs. SaaS providers building on AWS should design and implement throttling strategies in all of their APIs as well.

In this two-part blog series, we will explore tiering and throttling strategies for multi-tenant REST APIs and review tenant isolation models with hands-on sample code. In part 1, we will look at why a tiering and throttling strategy is needed and show how Amazon API Gateway can help by showing sample code. In part 2, we will dive deeper into tenant isolation models as well as considerations for production.

We selected Amazon API Gateway for this architecture since it is a fully managed service that helps developers to create, publish, maintain, monitor, and secure APIs. First, let’s focus on how Amazon API Gateway can be used to throttle REST APIs with fine granularity using Usage Plans and API Keys. Usage Plans define the thresholds beyond which throttling should occur. They also enable quotas, which sets a maximum usage per a day, week, or month. API Keys are identifiers for distinguishing traffic and determining which Usage Plans to apply for each request. We limit the scope of our discussion to REST APIs because other protocols that API Gateway supports — WebSocket APIs and HTTP APIs — have different throttling mechanisms that do not employ Usage Plans or API Keys.

SaaS providers must balance minimizing cost to serve and providing consistent quality of service for all tenants. They also need to ensure one tenant’s activity does not affect the other tenants’ experience. Throttling and quotas are a key aspect of a tiering strategy and important for protecting your service at any scale. In practice, this impact of throttling polices and quota management is continuously monitored and evaluated as the tenant composition and behavior evolve over time.

Architecture Overview

Figure 1. Cloud Architecture of the sample code.

Figure 1 – Architecture of the sample code

To get a firm foundation of the basics of throttling and quotas with API Gateway, we’ve provided sample code in AWS-Samples on GitHub. Not only does it provide a starting point to experiment with Usage Plans and API Keys in the API Gateway, but we will modify this code later to address complexity that happens at scale. The sample code has two main parts: 1) a web frontend and, 2) a serverless backend. The backend is a serverless architecture using Amazon API Gateway, AWS Lambda, Amazon DynamoDB, and Amazon Cognito. As Figure I illustrates, it implements one REST API endpoint, GET /api, that is protected with throttling and quotas. There are additional APIs under the /admin/* resource to provide Read access to Usage Plans, and CRUD operations on API Keys.

All these REST endpoints could be tested with developer tools such as curl or Postman, but we’ve also provided a web application, to help you get started. The web application illustrates how tenants might interact with the SaaS application to browse different tiers of service, purchase API Keys, and test them. The web application is implemented in React and uses AWS Amplify CLI and SDKs.

Prerequisites

To deploy the sample code, you should have the following prerequisites:

For clarity, we’ll use the environment variable, ${TOP}, to indicate the top-most directory in the cloned source code or the top directory in the project when browsing through GitHub.

Detailed instructions on how to install the code are in ${TOP}/INSTALL.md file in the code. After installation, follow the ${TOP}/WALKTHROUGH.md for step-by-step instructions to create a test key with a very small quota limit of 10 requests per day, and use the client to hit that limit. Search for HTTP 429: Too Many Requests as the signal your client has been throttled.

Figure 2: The web application (with browser developer tools enabled) shows that a quick succession of API calls starts returning an HTTP 429 after the quota for the day is exceeded.

Figure 2: The web application (with browser developer tools enabled) shows that a quick succession of API calls starts returning an HTTP 429 after the quota for the day is exceeded.

Responsibilities of the Client to support Throttling

The Client must provide an API Key in the header of the HTTP request, labelled, “X-Api-Key:”. If a resource in API Gateway has throttling enabled and that header is missing or invalid in the request, then API Gateway will reject the request.

Important: API Keys are simple identifiers, not authorization tokens or cryptographic keys. API keys are for throttling and managing quotas for tenants only and not suitable as a security mechanism. There are many ways to properly control access to a REST API in API Gateway, and we refer you to the AWS documentation for more details as that topic is beyond the scope of this post.

Clients should always test for the response to any network call, and implement logic specific to an HTTP 429 response. The correct action is almost always “try again later.” Just how much later, and how many times before giving up, is application dependent. Common approaches include:

  • Retry – With simple retry, client retries the request up to defined maximum retry limit configured
  • Exponential backoff – Exponential backoff uses progressively larger wait time between retries for consecutive errors. As the wait time can become very long quickly, maximum delay and a maximum retry limits should be specified.
  • Jitter – Jitter uses a random amount of delay between retry to prevent large bursts by spreading the request rate.

AWS SDK is an example client-responsibility implementation. Each AWS SDK implements automatic retry logic that uses a combination of retry, exponential backoff, jitter, and maximum retry limit.

SaaS Considerations: Tenant Isolation Strategies at Scale

While the sample code is a good start, the design has an implicit assumption that API Gateway will support as many API Keys as we have number of tenants. In fact, API Gateway has a quota on available per region per account. If the sample code’s requirements are to support more than 10,000 tenants (or if tenants are allowed multiple keys), then the sample implementation is not going to scale, and we need to consider more scalable implementation strategies.

This is one instance of a general challenge with SaaS called “tenant isolation strategies.” We highly recommend reviewing this white paper ‘SasS Tenant Isolation Strategies‘. A brief explanation here is that the one-resource-per-customer (or “siloed”) model is just one of many possible strategies to address tenant isolation. While the siloed model may be the easiest to implement and offers strong isolation, it offers no economy of scale, has high management complexity, and will quickly run into limits set by the underlying AWS Services. Other models besides siloed include pooling, and bridged models. Again, we recommend the whitepaper for more details.

Figure 3. Tiered multi-tenant architectures often employ different tenant isolation strategies at different tiers. Our example is specific to API Keys, but the technique generalizes to storage, compute, and other resources.

Figure 3- Tiered multi-tenant architectures often employ different tenant isolation strategies at different tiers. Our example is specific to API Keys, but the technique generalizes to storage, compute, and other resources.

In this example, we implement a range of tenant isolation strategies at different tiers of service. This allows us to protect against “noisy-neighbors” at the highest tier, minimize outlay of limited resources (namely, API-Keys) at the lowest tier, and still provide an effective, bounded “blast radius” of noisy neighbors at the mid-tier.

A concrete development example helps illustrate how this can be implemented. Assume three tiers of service: Free, Basic, and Premium. One could create a single API Key that is a pooled resource among all tenants in the Free Tier. At the other extreme, each Premium customer would get their own unique API Key. They would protect Premium tier tenants from the ‘noisy neighbor’ effect. In the middle, the Basic tenants would be evenly distributed across a set of fixed keys. This is not complete isolation for each tenant, but the impact of any one tenant is contained within “blast radius” defined.

In production, we recommend a more nuanced approach with additional considerations for monitoring and automation to continuously evaluate tiering strategy. We will revisit these topics in greater detail after considering the sample code.

Conclusion

In this post, we have reviewed how to effectively guard a tiered multi-tenant REST API hosted in Amazon API Gateway. We also explored how tiering and throttling strategies can influence tenant isolation models. In Part 2 of this blog series, we will dive deeper into tenant isolation models and gaining insights with metrics.

If you’d like to know more about the topic, the AWS Well-Architected SaaS Lens Performance Efficiency pillar dives deep on tenant tiers and providing differentiated levels of performance to each tier. It also provides best practices and resources to help you design and reduce impact of noisy neighbors your SaaS solution.

To learn more about Serverless SaaS architectures in general, we recommend the AWS Serverless SaaS Workshop and the SaaS Factory Serverless SaaS reference solution that inspired it.

ICYMI: Serverless Q1 2022

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/icymi-serverless-q1-2022/

Welcome to the 16th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

Calendar

In case you missed our last ICYMI, check out what happened last quarter here.

AWS Lambda

Lambda now offers larger ephemeral storage for functions, up to 10 GB. Previously, the storage was set to 512 MB. There are several common use-cases that can benefit from expanded temporary storage, including extract-transform load (ETL) jobs, machine learning inference, and data processing workloads. To see how to configure the amount of /tmp storage in AWS SAM, deploy this Serverless Land Pattern.

Ephemeral storage settings

For Node.js developers, Lambda now supports ES Modules and top-level await for Node.js 14. This enables developers to use a wider range of JavaScript packages in functions. With top-level await, when used with Provisioned Concurrency, this can improve cold-start performance when using asynchronous initialization.

For .NET developers, Lambda now supports .NET 6 as both a managed runtime and container base image. You can now use new features of the runtime such as improved logging, simplified function definitions using top-level statements, and improved performance using source generators.

The Lambda console now allows you to share test events with other developers in your team, using granular IAM permissions. Previously, test events were only visible to the builder who created them. To learn about creating sharable test events, read this documentation.

Amazon EventBridge

Amazon EventBridge Schema Registry helps you create code bindings from event schemas for use directly in your preferred IDE. You can generate these code bindings for a schema by using the EventBridge console, APIs, or AWS SDK toolkits for Jetbrains (Intellij, PyCharm, Webstorm, Rider) and VS Code. This feature now supports Go, in addition to Java, Python, and TypeScript, and is available at no additional cost.

AWS Step Functions

Developers can test state machines locally using Step Functions Local, and the service recently announced mocked service integrations for local testing. This allows you to define sample output from AWS service integrations and combine them into test cases to validate workflow control. This new feature introduces a robust way to state machines in isolation.

Amazon DynamoDB

Amazon DynamoDB now supports limiting the number of items processed in PartiQL operation, using an optional parameter on each request. The service also increased default Service Quotas, which can help simplify the use of large numbers of tables. The per-account, per-Region quota increased from 256 to 2,500 tables.

AWS AppSync

AWS AppSync added support for custom response headers, allowing you to define additional headers to send to clients in response to an API call. You can now use the new resolver utility $util.http.addResponseHeaders() to configure additional headers in the response for a GraphQL API operation.

Serverless blog posts

January

Jan 6 – Using Node.js ES modules and top-level await in AWS Lambda

Jan 6 – Validating addresses with AWS Lambda and the Amazon Location Service

Jan 20 – Introducing AWS Lambda batching controls for message broker services

Jan 24 – Migrating AWS Lambda functions to Arm-based AWS Graviton2 processors

Jan 31 – Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB

Jan 31 – Mocking service integrations with AWS Step Functions Local

February

Feb 8 – Capturing client events using Amazon API Gateway and Amazon EventBridge

Feb 10 – Introducing AWS Virtual Waiting Room

Feb 14 – Building custom connectors using the Amazon AppFlow Custom Connector SDK

Feb 22 – Building TypeScript projects with AWS SAM CLI

Feb 24 – Introducing the .NET 6 runtime for AWS Lambda

March

Mar 6 – Migrating a monolithic .NET REST API to AWS Lambda

Mar 7 – Decoding protobuf messages using AWS Lambda

Mar 8 – Building a serverless image catalog with AWS Step Functions Workflow Studio

Mar 9 – Composing AWS Step Functions to abstract polling of asynchronous services

Mar 10 – Building serverless multi-Region WebSocket APIs

Mar 15 – Using organization IDs as principals in Lambda resource policies

Mar 16 – Implementing mutual TLS for Java-based AWS Lambda functions

Mar 21 – Running cross-account workflows with AWS Step Functions and Amazon API Gateway

Mar 22 – Sending events to Amazon EventBridge from AWS Organizations accounts

Mar 23 – Choosing the right solution for AWS Lambda external parameters

Mar 28 – Using larger ephemeral storage for AWS Lambda

Mar 29 – Using AWS Step Functions and Amazon DynamoDB for business rules orchestration

Mar 31 – Optimizing AWS Lambda function performance for Java

First anniversary of Serverless Land Patterns

Serverless Patterns Collection

The DA team launched the Serverless Patterns Collection in March 2021 as a repository of serverless examples that demonstrate integrating two or more AWS services. Each pattern uses an infrastructure as code (IaC) framework to automate the deployment. These can simplify the creation and configuration of the services used in your applications.

The Serverless Patterns Collection is both an educational resource to help developers understand how to join different services, and an aid for developers that are getting started with building serverless applications.

The collection has just celebrated its first anniversary. It now contains 239 patterns for CDK, AWS SAM, Serverless Framework, and Terraform, covering 30 AWS services. We have expanded example runtimes to include .NET, Java, Rust, Python, Node.js and TypeScript. We’ve served tens of thousands of developers in the first year and we’re just getting started.

Many thanks to our contributors and community. You can also contribute your own patterns.

Videos

YouTube: youtube.com/serverlessland

Serverless Office Hours – Tues 10 AM PT

Weekly live virtual office hours. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

YouTube: youtube.com/serverlessland
Twitch: twitch.tv/aws

January

February

March

FooBar Serverless YouTube channel

The Developer Advocate team is delighted to welcome Marcia Villalba onboard. Marcia was an AWS Serverless Hero before joining AWS over two years ago, and she has created one of the most popular serverless YouTube channels. You can view all of Marcia’s videos at https://www.youtube.com/c/FooBar_codes.

January

February

March

AWS Summits

AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. This year, we have restarted in-person Summits at major cities around the world.

The next 4 Summits planned are Paris (April 12), San Francisco (April 20-21), London (April 27), and Madrid (May 4-5). To find and register for your nearest AWS Summit, visit the AWS Summits homepage.

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

How Net at Work built an email threat report system on AWS

Post Syndicated from Florian Mair original https://aws.amazon.com/blogs/architecture/how-net-at-work-built-an-email-threat-report-system-on-aws/

Emails are often used as an entry point for malicious software like trojan horses, rootkits, or encryption-based ransomware. The NoSpamProxy offering developed by Net at Work tackles this threat, providing secure and confidential email communication.

A subservice of NoSpamProxy called 32guards is responsible for threat reports of inbound and outbound emails. With the increasing number of NoSpamProxy customers, 32guards was found to have several limitations. 32guards was previously built on a relational database. But with the growth in traffic, this database was not able to keep up with storage demands and expected query performance. Further, the relational database schema was limiting the possibilities of complex pattern detections, due to performance limitations. The NoSpamProxy team decided to rearchitect the service based on the Lake House approach.

The goal was to move away from a one-size-fits-all approach for data analytics and integrate a data lake with purpose-built data stores, unified governance, and smooth data movement.

This post shows how Net at Work modernized their 32guards service, from a relational database to a fully serverless analytics solution. With adoption of the Well-Architected Analytics Lens best practices and the use of fully managed services, the 32guards team was able to build a production-ready application within six weeks.

Architecture for email threat reports and analytics

This section gives a walkthrough of the solution’s architecture, as illustrated in Figure 1.

Figure 1. 32guards threat reports architecture

Figure 1. 32guards threat reports architecture

1. The entry point is an Amazon API Gateway, which receives email metadata in JSON format from the NoSpamProxy fleet. The message contains information about the email in general, email attachments, and URLs in the email. As an example, a subset of the data is presented in JSON as follows:

{
  ...
  "Attachments": [
    {
      "Sha256Hash": "69FB43BD7CCFD79E162B638596402AD1144DD5D762DEC7433111FC88EDD483FE",
      "Classification": 0,
      "Filename": "test.ods.tar.gz",
      "DetectedMimeType": "application/tar+gzip",
      "Size": 5895
    }
  ],
  "Urls": [
    {
      "Url": "http://www.aarhhie.work/",
      "Classification": 0,
    },        {
      "Url": "http://www.netatwork.de/",
      "Classification": 0,
    },
    {
      "Url": "http://aws.amazon.com/",
      "Classification": 0,
    }
  ]
}

2. This JSON message is forwarded to an AWS Lambda function (called “frontend”), which takes care of the further downstream processing. There are two activities the Lambda function initiates:

  • Forwarding the record for real-time analysis/storage
  • Generating a threat report based on the information derived from the data stored in the indicators of compromises (IOCs) Amazon DynamoDB table

IOCs are patterns within the email metadata that are used to determine if emails are safe or not. For example, this could be for a suspicious file attachment or domain.

Threat report for suspicious emails

In the preceding JSON message, the attachments and URLs have been classified with “0” by the email service itself, which indicates that none of them look suspicious. The frontend Lambda function uses the vast number of IOCs stored in the DynamoDB table and heuristics to determine any potential threats within the email. The use of DynamoDB enables fast lookup times to generate a threat report. For the example, the response to the API Gateway in step 2 looks like this:

{
  "ReportedOnUtc": "2021-10-14T14:33:34.5070945Z",
  "Reason": "realtimeSuspiciousOrganisationalDomain",
  "Identifier": "aarhhie.work",
  ...
}

This threat report shows that the top-level domain “aarhiie.work” has been detected as suspicious. The report is used to determine further actions for the email, such as blocking.

Real-time data processing

3. In the real-time analytics flow, the frontend Lambda function ingests email metadata into a data stream using Amazon Kinesis Data Streams. This is a massively scalable, serverless, and durable real-time data streaming service. Compared to a queue, streaming storage permits more than one consumer of the same data.

4. The first consumer is an Apache Flink application running in Amazon Kinesis Data Analytics. This application generates statistical metrics (for example, occurrences of the top-level domain “.work”). The output is stored in Apache Parquet format on Amazon S3. Parquet is a columnar storage format for row-based files like csv.

The second consumer of the streaming data is Amazon Kinesis Data Firehose. Kinesis Data Firehose is a fully managed solution to reliably load streaming data into data lakes, data stores, and analytics services. Within the 32guards service, Kinesis Data Firehose is used to store all email metadata into Amazon S3. The data is stored in Apache Parquet format, which makes queries more time and cost efficient.

IOC detection

Now that we have shown how data is ingested and threat reports are generated to respond quickly to requests, let’s look at how the IOCs are updated. These IOCs are used for generating the threat report within the “frontend” Lambda function. As attack vectors are changing over time, quickly analyzing the data for new threats, is crucial to provide high-quality reports to the NoSpamProxy service.

The incoming email metadata is stored every few minutes in Amazon S3 by Kinesis Data Firehose. To query data directly in Amazon S3, Amazon Athena is used. Athena is a serverless query service that analyzes data stored in Amazon S3, by using standard SQL syntax.

5. To be able to query data in S3, Amazon Athena uses the AWS Glue Data Catalog, which contains the structure of the email metadata stored in the data lake. The data structure is derived from the data itself using AWS Glue Crawlers. Other external downstream processing services like business intelligence applications, also use Amazon Athena to consume the data.

6. Athena queries are initiated on a predefined schedule to update or generate new IOCs. The results of these queries are stored in the DynamoDB table to enable fast lookup times for the “frontend” Lambda.

Conclusion

In this blog post, we showed how Net at Work modernized their 32guards service within their NoSpamProxy product. The previous architecture used a relational database to ingest and store email metadata. This database was running into performance and storage issues, and must be redesigned into a more performant and scalable architecture.

Amazon S3 is used as the storage layer, which can scale up to exabytes of data. With Amazon Athena as the query engine, there is no need to operate a high-performance database cluster, as compute and storage is separated. By using Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics, valuable insight can be generated in real time, and acted upon more quickly.

As a serverless, fully managed solution, the 32guards service has a lower-cost footprint of as much as 50% and requires less maintenance. By moving away from a relational database model, the query runtimes decrease significantly. You can now conduct analyses that have not been feasible before.

Interested in the NoSpamProxy? Read more about NoSpamProxy or sign up for a free trial.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Using AWS Step Functions and Amazon DynamoDB for business rules orchestration

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-aws-step-functions-and-amazon-dynamodb-for-business-rules-orchestration/

This post is written by Vijaykumar Pannirselvam, Cloud Consultant, Sushant Patil, Cloud Consultant, and Kishore Dhamodaran, Senior Solution Architect.

A Business Rules Engine (BRE) is used in enterprises to manage business-critical decisions. The logic or rules used to make such decisions can vary in complexity. A finance department may have a basic rule to get any purchase over a certain dollar amount to get director approval. A mortgage company may need to run complex rules based on inputs (for example, credit score, debt-to-income ratio, down payment) to make an approval decision for a loan.

Decoupling these rules from application logic provides agility to your rules management, since business rules may often change while your application may not. It can also provide standardization across your enterprise, so every department can communicate with the same taxonomy.

As part of migrating their workloads, some enterprises consider replacing their commercial rules engine with cloud native and open-source alternatives. The motivation for such a move stem from several factors, such as simplifying the architecture, cost, security considerations, or vendor support.

Many of these commercial rules engines come as part of a BPMS offering that provides orchestration capabilities for rules execution. For a successful migration to cloud using an open-source rules engine management system, you need an orchestration capability to manage incoming rule requests, auditing the rules, and tracking exceptions.

This post showcases an orchestration framework that allows you to use an open-source rules engine. It uses Drools rules engine to build a set of rules for calculating insurance premiums based on the properties of Car and Person objects. This uses AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon DynamoDB, and open-source Drools rules engine to show this. You can swap the rules engine provided you can manage it in the AWS Cloud environment and expose it as an API.

Solution overview

The following diagram shows the solution architecture.

Solution architecture

The solution comprises:

  1. API Gateway – a fully managed service that makes it easier to create, publish, maintain, monitor, and secure APIs at any scale for API consumers. API Gateway helps you manage traffic to backend systems, in this case Step Functions, which orchestrates the execution of tasks. For the REST API use-case, you can also set up a cache with customizable keys and time-to-live in seconds for your API data to avoid hitting your backend services for each request.
  2. Step Functions – a low code service to orchestrate multiple steps involved to accomplish tasks. Step Functions uses the finite-state machine (FSM) model, which uses given states and transitions to complete the tasks. The diagram depicts three states: Audit Request, Execute Ruleset and Audit Response. We execute them sequentially. You can add additional states and transitions, such as validating incoming payloads, and branching out parallel execution of the states.
  3. Drools rules engine Spring Boot application – runtime component of the rule execution. You set the Drools rule engine Spring Boot application as an Apache Maven Docker project with Drools Maven dependencies. You then deploy the Drools rule engine Docker image to an Amazon Elastic Container Registry (Amazon ECR), create an AWS Fargate cluster, and an Amazon Elastic Container Service (Amazon ECS) service. The service launches Amazon ECS tasks and maintains the desired count. An Application Load Balancer distributes the traffic evenly to all running containers.
  4. Lambda – a serverless execution environment giving you an ability to interact with the Drools Engine and a persistence layer for rule execution audit functions. The Lambda component provides the audit function required to persist the incoming requests and outgoing responses in DynamoDB. Apart from the audit function, Lambda is also used to invoke the service exposed by the Drools Spring Boot application.
  5. DynamoDB – a fully managed and highly scalable key/value store, to persist the rule execution information, such as request and response payload information. DynamoDB provides the persistence layer for the incoming request JSON payload and for the outgoing response JSON payload. The audit Lambda function invokes the DynamoDB put_item() method when it receives the request or response event from Step Functions. The DynamoDB table rule_execution_audit has an entry for every request and response associated with the incoming request-id originated by the application (upstream).

Drools rules engine implementation

The Drools rules engine separates the business rules from the business processes. You use DRL (Drools Rule Language) by defining business rules as .drl text files. You define model objects to build the rules.

The model objects are POJO (Plain Old Java Objects) defined using Eclipse, with the Drools plugin installed. You should have some level of knowledge about building rules and executing them using the Drools rules engine. The below diagram describes the functions of this component.

Drools process

You define the following rules in the .drl file as part of the GitHub repo. The purpose of these rules is to evaluate the driver premium based on the input model objects provided as input. The inputs are Car and Driver objects and output is the Policy object, which has the premium calculated based on the certain criteria defined in the rule:

rule "High Risk"
     when     
         $car : Car(style == "SPORTS", color == "RED") 
         $policy : Policy() 
         and $driver : Driver ( age < 21 )                             
     then
         System.out.println(drools.getRule().getName() +": rule fired");          
         modify ($policy) { setPremium(increasePremiumRate($policy, 20)) };
 end
 
 rule "Med Risk"
     when     
         $car : Car(style == "SPORTS", color == "RED") 
         $policy : Policy() 
         and $driver : Driver ( age > 21 )                             
     then
         System.out.println(drools.getRule().getName() +": rule fired");          
         modify ($policy) { setPremium(increasePremiumRate($policy, 10)) };
 end
 
 
 function double increasePremiumRate(Policy pol, double percentage) {
     return (pol.getPremium() + pol.getPremium() * percentage / 100);
 }
 

Once the rules are defined, you define a RestController that takes input parameters and evaluates the above rules. The below code snippet is a POST method defined in the controller, which handles the requests and sends the response to the caller.

@PostMapping(value ="/policy/premium", consumes = {MediaType.APPLICATION_JSON_VALUE, MediaType.APPLICATION_XML_VALUE }, produces = {MediaType.APPLICATION_JSON_VALUE, MediaType.APPLICATION_XML_VALUE})
    public ResponseEntity<Policy> getPremium(@RequestBody InsuranceRequest requestObj) {
        
        System.out.println("handling request...");
        
        Car carObj = requestObj.getCar();        
        Car carObj1 = new Car(carObj.getMake(),carObj.getModel(),carObj.getYear(), carObj.getStyle(), carObj.getColor());
        System.out.println("###########CAR##########");
        System.out.println(carObj1.toString());
        
        System.out.println("###########POLICY##########");        
        Policy policyObj = requestObj.getPolicy();
        Policy policyObj1 = new Policy(policyObj.getId(), policyObj.getPremium());
        System.out.println(policyObj1.toString());
            
        System.out.println("###########DRIVER##########");    
        Driver driverObj = requestObj.getDriver();
        Driver driverObj1 = new Driver( driverObj.getAge(), driverObj.getName());
        System.out.println(driverObj1.toString());
        
        KieSession kieSession = kieContainer.newKieSession();
        kieSession.insert(carObj1);      
        kieSession.insert(policyObj1); 
        kieSession.insert(driverObj1);         
        kieSession.fireAllRules(); 
        printFactsMessage(kieSession);
        kieSession.dispose();
    
        
        return ResponseEntity.ok(policyObj1);
    }    

Prerequisites

Solution walkthrough

  1. Clone the project GitHub repository to your local machine, do a Maven build, and create a Docker image. The project contains Drools related folders needed to build the Java application.
    git clone https://github.com/aws-samples/aws-step-functions-business-rules-orchestration
    cd drools-spring-boot
    mvn clean install
    mvn docker:build
    
  2. Create an Amazon ECR private repository to host your Docker image.
    aws ecr create-repository —repository-name drools_private_repo —image-tag-mutability MUTABLE —image-scanning-configuration scanOnPush=false
  3. Tag the Docker image and push it to the Amazon ECR repository.
    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com
    docker tag drools-rule-app:latest <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com/drools_private_repo:latest
    docker push <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com/drools_private_repo:latest
    
  4. Deploy resources using AWS SAM:
    cd ..
    sam build
    sam deploy --guided

    SAM deployment output

Verifying the deployment

Verify the business rules execution and the orchestration components:

  1. Navigate to the API Gateway console, and choose the rules-stack API.
    API Gateway console
  2. Under Resources, choose POST, followed by TEST.
    Resource configuration
  3. Enter the following JSON under the Request Body section, and choose Test.

    {
      "context": {
        "request_id": "REQ-99999",
        "timestamp": "2021-03-17 03:31:51:40"
      },
      "request": {
        "driver": {
          "age": "18",
          "name": "Brian"
        },
        "car": {
          "make": "honda",
          "model": "civic",
          "year": "2015",
          "style": "SPORTS",
          "color": "RED"
        },
        "policy": {
          "id": "1231231",
          "premium": "300"
        }
      }
    }
    
  4. The response received shows results from the evaluation of the business rule “High Risk“, with the premium representing the percentage calculation in the rule definition. Try changing the request input to evaluate a “Medium Risk” rule by modifying the age of the driver to 22 or higher:
    Sample response
  5. Optionally, you can verify the API using Postman. Get the endpoint information by navigating to the rule-stack API, followed by Stages in the navigation pane, then choosing either Dev or Stage.
  6. Enter the payload in the request body and choose Send:
    Postman UI
  7. The response received is results from the evaluation of business rule “High Risk“, with the premium representing the percentage calculation in the rule definition. Try changing the request input to evaluate a “Medium Risk” rule by modifying the age of the driver to 22 or higher.
    Body JSON
  8. Observe the request and response audit logs. Navigate to the DynamoDB console. Under the navigation pane, choose Tables, then choose rule_execution_audit.
    DynamoDB console
  9. Under the Tables section in the navigation pane, choose Explore Items. Observe the individual audit logs by choosing the audit_id.
    Table audit item

Cleaning up

To avoid incurring ongoing charges, clean up the infrastructure by deleting the stack using the following command:

sam delete SAM confirmations

Delete the Amazon ECR repository, and any other resources you created as a prerequisite for this exercise.

Conclusion

In this post, you learned how to leverage an orchestration framework using Step Functions, Lambda, DynamoDB, and API Gateway to build an API backed by an open-source Drools rules engine, running on a container. Try this solution for your cloud native business rules orchestration use-case.

For more serverless learning resources, visit Serverless Land.

Choosing the right solution for AWS Lambda external parameters

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/choosing-the-right-solution-for-aws-lambda-external-parameters/

This post is written by Thomas Moore, Solutions Architect, Serverless.

When using AWS Lambda to build serverless applications, customers often need to retrieve parameters from an external source at runtime. This allows you to share parameter values across multiple functions or microservices, providing a single source of truth for updates. A common example is retrieving database connection details from an external source and then using the retrieved hostname, user name, and password to connect to the database:

Lambda function retrieving database credentials from an external source

Lambda function retrieving database credentials from an external source

AWS provides a number of options to store parameter data, including AWS Systems Manager Parameter Store, AWS AppConfig, Amazon S3, and Lambda environment variables. This blog explores the different parameter data that you may need to store. I cover considerations for choosing the right parameter solution and how to retrieve and cache parameter data efficiently within the Lambda function execution environment.

Common use cases

Common parameter examples include:

  • Securely storing secret data, such as credentials or API keys.
  • Database connection details such as hostname, port, and credentials.
  • Schema data (for example, a structured JSON response).
  • TLS certificate for mTLS or JWT validation.
  • Email template.
  • Tenant configuration in a multitenant system.
  • Details of external AWS resources to communicate with such as an Amazon SQS queue URL, Amazon EventBridge event bus name, or AWS Step Functions ARN.

Key considerations

There are a number of key considerations when choosing the right solution for external parameter data.

  1. Cost – how much does it cost to store the data and retrieve it via an API call?
  2. Security – what encryption and fine-grained access control is required?
  3. Performance – what are the retrieval latency requirements?
  4. Data size – how much data is there to store and retrieve?
  5. Update frequency – how often does the parameter change and how does the function handle stale parameters?
  6. Access scope – do multiple functions or services access the parameter?

These considerations help to determine where to store the parameter data and how often to retrieve it.

For example, a 4KB parameter that updates hourly and is used by hundreds of functions needs to be optimized for low retrieval costs and high performance. Choosing a solution that supports low-cost API GET requests at a high transaction per second (TPS) would be better than one that supports large data.

AWS service options

There are a number of AWS services available to store external parameter data.

Amazon S3

S3 is an object storage service offering 99.999999999% (11 9s) of data durability and virtually unlimited scalability at low cost. Objects can be up to 5 TB in size in any format, making S3 a good solution to store larger parameter data.

Amazon DynamoDB

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed for single-digit millisecond performance at any scale. Due to the high performance of this service, it’s a great place to store parameters when low retrieval latency is important.

AWS Secrets Manager

AWS Secrets Manager makes it easier to rotate, manage, and retrieve secret data. This makes it the ideal place to store sensitive parameters such as passwords and API keys.

AWS Systems Manager Parameter Store

Parameter Store provides a centralized store to manage configuration data. This data can be plaintext or encrypted using AWS Key Management Service (KMS). Parameters can be tagged and organized into hierarchies for simpler management. Parameter Store is a good default choice for general-purpose parameters in AWS. The standard version (no additional charge) can store parameters up to 4 KB in size and the advanced version (additional charges apply) up to 8 KB.

For a code example using Parameter Store for Lambda parameters, see the Serverless Land pattern.

AWS AppConfig

AppConfig is a capability of AWS Systems Manager to create, manage, and quickly deploy application configurations. AppConfig allows you to validate changes during roll-outs and automatically roll back, if there is an error. AppConfig deployment strategies help to manage configuration changes safely.

AppConfig also provides a Lambda extension to retrieve and locally cache configuration data. This results in fewer API calls and reduced function duration, reducing costs.

AWS Lambda environment variables

You can store parameter data as Lambda environment variables as part of the function’s version-specific configuration. Lambda environment variables are stored during function creation or updates. You can access these variables directly from your code without needing to contact an external source. Environment variables are ideal for parameter values that don’t need updating regularly and help make function code reusable across different environments. However, unlike the other options, values cannot be accessed centrally by multiple functions or services.

Lambda execution lifecycle

It is worth understanding the Lambda execution lifecycle, which has a number of stages. This helps to decide when to handle parameter retrieval within your Lambda code, including cache management.

Lambda execution lifecycle

Lambda execution lifecycle

When a Lambda function is invoked for the first time, or when Lambda is scaling to handle additional requests, an execution environment is created. The first phase in the execution environment’s lifecycle is initialization (Init), during which the code outside the main handler function runs. This is known as a cold start.

The execution environment can then be re-used for subsequent invocations. This means that the Init phase does not need to run again and only the main handler function code runs. This is known as a warm start.

An execution environment can only run a single invocation at a time. Concurrent invocations require additional execution environments. When a new execution environment is required, this starts a new Init phase, which runs the cold start process.

Caching and updates

Retrieving the parameter during Init

Retrieving the parameter during Init

Retrieving the parameter during Init

As Lambda execution environments are re-used, you can improve the performance and reduce the cost of retrieving an external parameter by caching the value. Writing the value to memory or the Lambda /tmp file system allows it to be available during subsequent invokes in the same execution environment.

This approach reduces API calls, as they are not made during every invocation. However, this can cause an out-of-date parameter and potentially different values across concurrent execution environments.

The following Python example shows how to retrieve a Parameter Store value outside the Lambda handler function during the Init phase.

import boto3
ssm = boto3.client('ssm', region_name='eu-west-1')
parameter = ssm.get_parameter(Name='/my/parameter')
def lambda_handler(event, context):
    # My function code...

Retrieving the parameter on every invocation

Retrieving the parameter on every invocation

Retrieving the parameter on every invocation

Another option is to retrieve the parameter during every invocation by making the API call inside the handler code. This keeps the value up to date, but can lead to higher retrieval costs and longer function durations due to the added API call during every invocation.

The following Python example shows this approach:

import boto3
ssm = boto3.client('ssm', region_name='eu-west-1')
def lambda_handler(event, context):
    parameter = ssm.get_parameter(Name='/my/parameter')
    # My function code...

Using AWS AppConfig Lambda extension

Using AWS AppConfig Lambda extension

Using AWS AppConfig Lambda extension

AppConfig allows you to retrieve and cache values from the service using a Lambda extension. The extension retrieves the values and makes them available via a local HTTP server. The Lambda function then queries the local HTTP server for the value. The AppConfig extension refreshes the values at a configurable poll interval, which defaults to 45 seconds. This improves performance and reduces costs, as the function only needs to make a local HTTP call.

The following Python code example shows how to access the cached parameters.

import urllib.request
def lambda_handler(event, context):
    url = f'http://localhost:2772/applications/application_name/environments/environment_name/configurations/configuration_name'
    config = urllib.request.urlopen(url).read()
    # My function code...

For caching secret values using a Lambda extension local HTTP cache and AWS Secrets Manager, see the AWS Prescriptive Guidance documentation.

Using Lambda Powertools for Python or Java

Lambda Powertools for Python or Lambda Powertools for Java contains utilities to manage parameter caching. You can configure the cache interval, which defaults to 5 seconds. Supported parameter stores include Secrets Manager, AWS Systems Manager Parameter Store, AppConfig, and DynamoDB. You also have the option to bring your own provider. The following example shows the Powertools for Python parameters utility retrieving a single value from Systems Manager Parameter Store.

from aws_lambda_powertools.utilities import parameters
def handler(event, context):
    value = parameters.get_parameter("/my/parameter")
    # My function code…

Security

Parameter security is a key consideration. You should evaluate encryption at rest, in-transit, private network access, and fine-grained permissions for each external parameter solution based on the use case.

All services highlighted in this post support server-side encryption at rest, and you can choose to use AWS KMS to manage your own keys. When accessing parameters using the AWS SDK and CLI tools, connections are encrypted in transit using TLS by default. You can force most to use TLS 1.2.

To access parameters from inside an Amazon Virtual Private Cloud (Amazon VPC) without internet access, you can use AWS PrivateLink and create a VPC endpoint for each service. All the services mentioned in this post support AWS PrivateLink connections.

Use AWS Identity and Access Management (IAM) policies to manage which users or roles can access specific parameters.

General guidance

This blog explores a number of considerations to make when using an external source for Lambda parameters. The correct solution is use-case dependent. There are some general guidelines when selecting an AWS service.

  • For general-purpose low-cost parameters, use AWS Systems Manager Parameter Store.
  • For single function, small parameters, use Lambda environment variables.
  • For secret values that require automatic rotation, use AWS Secrets Manager.
  • When you need a managed cache, use the AWS AppConfig Lambda extension or Lambda Powertools for Python/Java.
  • For items larger than 400 KB, use Amazon S3.
  • When access frequency is high, and low latency is required, use Amazon DynamoDB.

Conclusion

External parameters provide a central source of truth across distributed systems, allowing for efficient updates and code reuse. This blog post highlights a number of considerations when using external parameters with Lambda to help you choose the most appropriate solution for your use case.

Consider how you cache and reuse parameters inside the Lambda execution environment. Doing this correctly can help you reduce costs and improve the performance of your Lambda functions.

There are a number of services to choose from to store parameter data. These include DynamoDB, S3, Parameter Store, Secrets Manager, AppConfig, and Lambda environment variables. Each comes with a number of advantages, depending on the use case. This blog guidance, along with the AWS documentation and Service Quotas, can help you select the most appropriate service for your workload.

For more serverless learning resources, visit Serverless Land.

Mainframe offloading and modernization: Using mainframe data to build cloud native services with AWS

Post Syndicated from Malathi Pinnamaneni original https://aws.amazon.com/blogs/architecture/mainframe-offloading-and-modernization-using-mainframe-data-to-build-cloud-native-services-with-aws/

Many companies in the financial services and insurance industries rely on mainframes for their most business-critical applications and data. But mainframe workloads typically lack agility. This is one reason that organizations struggle to innovate, iterate, and pivot quickly to develop new applications or release new capabilities. Unlocking this mainframe data can be the first step in your modernization journey.

In this blog post, we will discuss some typical offloading patterns. Whether your goal is developing new applications using mainframe data or modernizing with the Strangler Fig Application pattern, you might want some guidance on how to begin.

Refactoring mainframe applications to the cloud

Refactoring mainframe applications to cloud-native services on AWS is a common industry pattern and a long-term goal for many companies to remain competitive. But this takes an investment of time, money, and organizational change management to realize the full benefits. We see customers start their modernization journey by offloading data from the mainframe to AWS to reduce risks and create new capabilities.

The mainframe data offloading patterns that we will discuss in this post use software services that facilitate data replication to Amazon Web Services (AWS):

  • File-based data synchronization
  • Change data capture
  • Event-sourced replication

Once data is liberated from the mainframe, you can develop new agile applications for deeper insights using analytics and machine learning (ML). You could create a microservices-based, or voice-based mobile application. For example, if a bank could access their historical mainframe data to analyze customer behavior, they could develop a new solution based on profiles to use for loan recommendations.

The patterns we illustrate can be used as a reference to begin your modernization efforts with reduced risk. The long-term goal is to rewrite the mainframe applications and modernize them workload by workload.

Solution overview: Mainframe offloading and modernization

This figure shows the flow of data being replicated from mainframe using integration services and consumed in AWS

Figure 1. Mainframe offloading and modernization conceptual flow

Mainframe modernization: Architecture reference patterns

File-based batch integration

Modernization scenarios often require replicating files to AWS, or synchronizing between on-premises and AWS. Use cases include:

  • Analyzing current and historical data to enhance business analytics
  • Providing data for further processing on downstream or upstream dependent systems. This is necessary for exchanging data between applications running on the mainframe and applications running on AWS
This diagram shows a file-based integration pattern on how data can be replicated to AWS for interactive data analytics

Figure 2. File-based batch ingestion pattern for interactive data analytics

File-based batch integration – Batch ingestion for interactive data analytics (Figure 2)

  1. Data ingestion. In this example, we show how data can be ingested to Amazon S3 using AWS Transfer Family Services or AWS DataSync. Mainframe data is typically encoded in extended binary-coded decimal interchange code (EBCDIC) format. Prescriptive guidance exists to convert EBCDIC to ASCII format.
  2. Data transformation. Before moving data to AWS data stores, transformation of the data may be necessary to use it for analytics. AWS analytics services like AWS Glue and AWS Lambda can be used to transform the data. For large volume processing, use Apache Spark on AWS Elastic Map Reduce (Amazon EMR), or a custom Spring Boot application running on Amazon EC2 to perform these transformations. This process can be orchestrated using AWS Step Functions or AWS Data Pipeline.
  3. Data store. Data is transformed into a consumable format that can be stored in Amazon S3.
  4. Data consumption. You can use AWS analytics services like Amazon Athena for interactive ad-hoc query access, Amazon QuickSight for analytics, and Amazon Redshift for complex reporting and aggregations.
This diagram shows a file-based integration pattern on how data can be replicated to AWS for further processing by downstream systems

Figure 3. File upload to operational data stores for further processing

File-based batch integration – File upload to operational data stores for further processing (Figure 3)

  1. Using AWS File Transfer Services, upload CSV files to Amazon S3.
  2. Once the files are uploaded, S3’s event notification can invoke AWS Lambda function to load to Amazon Aurora. For low latency data access requirements, you can use a scalable serverless import pattern with AWS Lambda and Amazon SQS to load into Amazon DynamoDB.
  3. Once the data is in data stores, it can be consumed for further processing.

Transactional replication-based integration (Figure 4)

Several modernization scenarios require continuous near-real-time replication of relational data to keep a copy of the data in the cloud. Change Data Capture (CDC) for near-real-time transactional replication works by capturing change log activity to drive changes in the target dataset. Use cases include:

  • Command Query Responsibility Segregation (CQRS) architectures that use AWS to service all read-only and retrieve functions
  • On-premises systems with tightly coupled applications that require a phased modernization
  • Real-time operational analytics
This diagram shows a transaction-based replication (CDC) integration pattern on how data can be replicated to AWS for building reporting and read-only functions

Figure 4. Transactional replication (CDC) pattern

  1. Partner CDC tools in the AWS Marketplace can be used to manage real-time data movement between the mainframe and AWS.
  2. You can use a fan-out pattern to read once from the mainframe to reduce processing requirements and replicate data to multiple data stores based on your requirements:
    • For low latency requirements, replicate to Amazon Kinesis Data Streams and use AWS Lambda to store in Amazon DynamoDB.
    • For critical business functionality with complex logic, use Amazon Aurora or Amazon Relational Database Service (RDS) as targets.
    • To build data lake or use as an intermediary for ETL processing, customers can replicate to S3 as target.
  3. Once the data is in AWS, customers can build agile microservices for read-only functions.

Message-oriented middleware (event sourcing) integration (Figure 5)

With message-oriented middleware (MOM) systems like IBM MQ on mainframe, several modernization scenarios require integrating with cloud-based streaming and messaging services. These act as a buffer to keep your data in sync. Use cases include:

  • Consume data from AWS data stores to enable new communication channels. Examples of new channels can be mobile or voice-based applications and can be innovations based on ML
  • Migrate the producer (senders) and consumer (receivers) applications communicating with on-premises MOM platforms to AWS with an end goal to retire on-premises MOM platform
This diagram shows an event-sourcing integration reference pattern for customers using middleware systems like IBM MQ on-premises with AWS services

Figure 5. Event-sourcing integration pattern

  1. Mainframe transactions from IBM MQ can be read using a connector or a bridge solution. They can then be published to Amazon MQ queues or Amazon Managed Streaming for Apache Kakfa (MSK) topics.
  2. Once the data is published to the queue or topic, consumers encoded in AWS Lambda functions or Amazon compute services can process, map, transform, or filter the messages. They can store the data in Amazon RDS, Amazon ElastiCache, S3, or DynamoDB.
  3. Now that the data resides in AWS, you can build new cloud-native applications and do the following:

Conclusion

Mainframe offloading and modernization using AWS services enables you to reduce cost, modernize your architectures, and integrate your mainframe and cloud-native technologies. You’ll be able to inform your business decisions with improved analytics, and create new opportunities for innovation and the development of modern applications.

More posts for Women’s History Month!

Other ways to participate

AWS Week in Review – March 14, 2022

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-14-2022/

This post is part of our Week in Review series. Check back each week for a quick round up of interesting news and announcements from AWS!

Welcome to the March 14 AWS Week in Review post, and Happy Pi Day! I hope you managed to catch some of our livestreamed Pi day celebration of the 16th birthday of Amazon Simple Storage Service (Amazon S3). I certainly had a lot of fun in the event, along with my co-hosts – check out the end of this post for some interesting facts and fun from the day.

First, let’s dive right into the news items and launches from the last week that caught my attention.

Last Week’s Launches
New X2idn and X2iedn EC2 Instance Types – Customers with memory-intensive workloads and a requirement for high networking bandwidth may be interested in the newly announced X2idn and X2iedn instance types, which are built on the AWS Nitro system. Featuring third-generation Intel Xeon Scalable (Ice Lake) processors, these instance types can yield up to 50 percent higher compute price performance and up to 45 percent higher SAP Application Performance Standard (SAPS) performance than comparable X1 instances. If you’re curious about the suffixes on those instance type names, they specify processor and other information. In this case, the i suffix indicates that the instances are using an Intel processor, e means it’s a memory-optimized instance family, d indicates local NVMe-based SSDs physically connected to the host server, and n means the instance types support higher network bandwidth up to 100 Gbps. You can find out more about the new instance types in this news blog post.

Amazon DynamoDB released two updates – First, an increase in the default service quotas raises the number of tables allowed by default from 256 to 2500 tables. This will help customers working with large numbers of tables. At the same time the service also increased the allowed number of concurrent table management operations, from 50 to 500. Table management operations are those that create, update, or delete tables. The second update relates to PartiQL a SQL-compatible query language you can use to query, insert, update, or delete DynamoDB table data. You can now specify a limit on the number of items processed. You’ll find this useful when you know you only need to process a certain number of items, helping reduce the cost and duration of requests.

If you’re coding against Amazon ECS‘s API, you may want to take a look at the change to UpdateService that now enables you to update load balancers, service registries, tag propagation, and ECS managed tags for a service. Previously, you would have had to delete and recreate the service to make changes to these resources for a service. Now you can do it all with one call, making it a hassle-free and less disruptive, more efficient experience. Take a look at the What’s New post for more details.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
If you’re analyzing time series data, take a look at this new book on building forecasting models and detecting anomalies in your data. It’s authored by Michael Hoarau, an AI/ML Specialist Solutions Architect at AWS.

March 8 was International Women’s Day and we published a post featuring several women, including fellow news blogger and published author Antje Barth, chatting about their experiences working in Developer Relations at AWS.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

.NET Application Modernization Webinar (March 23)Sign up today to learn about .NET modernization, what it is, and why you might want to modernize. The webinar will include a deep dive focusing on the AWS Microservice Extractor for .NET.

AWS Summit Brussels is fast approaching on March 31st. Register here.

Pi Day Fun & Facts
As this post is published, we’re coming to the end of our livestreamed Pi Day event celebrating the 16th birthday of S3 – how time flies! Here are some interesting facts & fun snippets from the event:

  • In the keynote, we learned S3 currently stores over 200 trillion objects, and serves over 100 million requests per second!
  • S3‘s Intelligent Tiering has saved customers over $250 million to date.
  • Did you know that S3, having reached 16 years of age, is now eligible for a Washington State drivers license? Or that it can now buy a lottery ticket, get a passport, or – check this – it can pilot a hang glider!
  • We asked each of our guests on the livestream, and the team of AWS news bloggers, to nominate their favorite pie. The winner? It’s a tie between apple and pecan pie!

That’s all for this week. Check back next Monday for another Week in Review!

— Steve

Happy 10th Birthday, DynamoDB! 🎉🎂🎁

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/happy-birthday-dynamodb/

On January 18th 2012, Jeff and Werner announced the general availability of Amazon DynamoDB, a fully managed flexible NoSQL database service for single-digit millisecond performance at any scale.

During the last 10 years, hundreds of thousands of customers have adopted DynamoDB. It regularly reaches new peaks of performance and scalability. For example, during the last Prime Day sales in June 2021, it handled trillions of requests over 66 hours while maintaining single-digit millisecond performance and peaked at 89.2 million requests per second. Disney+ uses DynamoDB to ingest content, metadata, and billions of viewers actions each day. Even during unprecedented demands caused by the pandemic, DynamoDB was able to help customers as many across the world had to change their way of working, needing to meet and conduct business virtually. For example, Zoom was able to scale from 10 million to 300 million daily meeting participants when we all started to make video calls in early 2020.

A decade of innovation with Amazon DynamoDB

On this special anniversary, join us for an unique online event on Twitch on March 1st. I’ll tell you more about this at the end of this post. But before talking about this event, let’s take this opportunity to reflect back on the genesis of this service and the main capabilities we added since the original launch 10 years ago.

The History Behind DynamoDB
The story of DynamoDB started long before the launch 10 years ago. It started with a series of outages on Amazon’s e-commerce platform during the holiday shopping season in 2004. At that time, Amazon was transitioning from a monolithic architecture to microservices. The design principle was (and still is) that each stateful microservice uses its own data store, and other services are required to access a microservice’s data through a publicly exposed API. Direct database access was not an option anymore. At that time, most microservices were using a relational database provided by a third-party vendor. Given the volume of traffic during the holiday season in 2004, the database system experienced some hard-to-debug and hard-to-reproduce deadlocks. The e-commerce platform was pushing the relational databases to their limits, despite the fact that we were using simple usage patterns, such as query by primary keys only. These usage patterns do not require the complexity of a relational database.

At Amazon and AWS, after an outage happens, we start a process called Correction of Error (COE) to document the root cause of the issue, to describe how we fixed it, and to detail the changes we’re making to avoid recurrence. During the COE for this database issue, a young, naïve, 20-year-old intern named Swaminathan (Swami) Sivasubramanian (now VP of the database, analytics, and ML organization at AWS) asked the question, “Why are we using a relational database for this? These workloads don’t need the SQL level of complexity and transactional guarantees.”

This led Amazon to rethink the architecture of its data stores and to build the original Dynamo database. The objective was to address the demanding scalability and reliability requirements of the Amazon e-commerce platform. This non-relational, key-value database was initially targeted at use cases that were the core of the Amazon e-commerce operations, such as the shopping basket and the session service.

AWS published the Dynamo paper in 2007, three years later, to describe our design principles and provide the lessons learned from running this database to support Amazon’s core e-commerce operations. Over the years, we saw several Dynamo clones appear, proving other companies were searching for scalable solutions, just like Amazon.

After a couple of years, Dynamo was adopted by several core service teams at Amazon. Their engineers were very satisfied with the performance and scalability. However, we started to interview engineers to understand why it was not more broadly adopted within Amazon. We learned Dynamo was giving teams the reliability, performance, and scalability they needed, but it did not simplify the operational complexity of running the system. Teams were still needed to install, configure, and operate the system in Amazon’s data centers.

At the time, AWS was proposing Amazon SimpleDB as a NoSQL service. Many teams preferred the operational simplicity of SimpleDB despite the difficulties to scale a domain beyond 10 GB, its non-predictable latency (it was affected by the size of the database and its indexes), and its eventual consistency model.

We concluded the ideal solution would combine the strengths of Dynamo—the scalability and the predictable low latency to retrieve data—with the operational simplicity of SimpleDB—just having a table to declare and let the system handle the low-level complexity transparently.

DynamoDB was born.

DynamoDB frees developers from the complexity of managing hardware and software. It handles all the complexity of scaling partitions and re-partitions your data to meet your throughput requirements. It scales seamlessly without the need to manually re-partition tables, and it provides predictable low latency access to your data (single-digit milliseconds).

At AWS, the moment we launch a new service is not the end of the project. It is actually the beginning. Over the last 10 years, we have continuously listened to your feedback, and we have brought new capabilities to DynamoDB. In addition to hundreds of incremental improvements, we added:

… and many more.

Lastly, during the last AWS re:Invent conference, we announced Amazon DynamoDB Standard-Infrequent Access (DynamoDB Standard-IA). This new DynamoDB table class allows you to lower the cost of data storage for infrequently accessed data by 60%. The ideal use case is for data that you need to keep for the long term and that your application needs to occasionally access, without compromising on access latency. In the past, to lower storage costs for such data, you were writing code to move infrequently accessed data to lower-cost storage alternatives, such as Amazon Simple Storage Service (Amazon S3). Now you can switch to the DynamoDB Standard-IA table class to store infrequently accessed data while preserving the high availability and performance of DynamoDB.

How To Get Started
To get started with DynamoDB, as a developer, you can refer to the Getting Started Guide in our documentation or read the excellent DynamoDB, Explained, written by Alex DeBrie, one of our AWS Heroes, and author of The DynamoDB Book. To dive deep into DynamoDB data modeling, AWS Hero Jeremy Daly is preparing a video course “DynamoDB Modeling for the rest of us“.

Customers now leverage DynamoDB across virtually any industry vertical, geographic area, and company size. You are continually surprising us with how you innovate on DynamoDB, and you are continually pushing us to continue to evolve DynamoDB to make it easier to build the next generation of applications. We are going to continue to work backwards from your feedback to meet your ever evolving needs and to enable you to innovate and scale for decades to come.

A Decade of Innovation with DynamoDB – A Virtual Event
As I mentioned at the beginning, we also would love to celebrate this anniversary with you. We prepared a live Twitch event for you to learn best practices, see technical demos, and attend a live Q&A. You will hear stories from two of our long-time customers : SmugMug CEO Don MacAskill, and engineering leaders from Dropbox. In addition, you’ll get a chance to ask your questions to and chat with AWS’ blog legend and Chief Evangelist Jeff Barr, and DynamoDB‘s product managers and engineers. Finally, AWS heroes Alex DeBrie and Jeremy Daly will host two deep dive technical sessions. Have a look at the full agenda here.

This will be live on Twitch on March 1st, you can register today. The first 1,000 registrants from US will receive a free digital copy of the DynamoDB book (this has a $79 retail value).

To DynamoDB’s next 10 years. Cheers 🥂.

— seb

Doing more with less: Moving from transactional to stateful batch processing

Post Syndicated from Tom Jin original https://aws.amazon.com/blogs/big-data/doing-more-with-less-moving-from-transactional-to-stateful-batch-processing/

Amazon processes hundreds of millions of financial transactions each day, including accounts receivable, accounts payable, royalties, amortizations, and remittances, from over a hundred different business entities. All of this data is sent to the eCommerce Financial Integration (eCFI) systems, where they are recorded in the subledger.

Ensuring complete financial reconciliation at this scale is critical to day-to-day accounting operations. With transaction volumes exhibiting double-digit percentage growth each year, we found that our legacy transactional-based financial reconciliation architecture proved too expensive to scale and lacked the right level of visibility for our operational needs.

In this post, we show you how we migrated to a batch processing system, built on AWS, that consumes time-bounded batches of events. This not only reduced costs by almost 90%, but also improved visibility into our end-to-end processing flow. The code used for this post is available on GitHub.

Legacy architecture

Our legacy architecture primarily utilized Amazon Elastic Compute Cloud (Amazon EC2) to group related financial events into stateful artifacts. However, a stateful artifact could refer to any persistent artifact, such as a database entry or an Amazon Simple Storage Service (Amazon S3) object.

We found this approach resulted in deficiencies in the following areas:

  • Cost – Individually storing hundreds of millions of financial events per day in Amazon S3 resulted in high I/O and Amazon EC2 compute resource costs.
  • Data completeness – Different events flowed through the system at different speeds. For instance, while a small stateful artifact for a single customer order could be recorded in a couple of seconds, the stateful artifact for a bulk shipment containing a million lines might require several hours to update fully. This made it difficult to know whether all the data had been processed for a given time range.
  • Complex retry mechanisms – Financial events were passed between legacy systems using individual network calls, wrapped in a backoff retry strategy. Still, network timeouts, throttling, or traffic spikes could result in some events erroring out. This required us to build a separate service to sideline, manage, and retry problematic events at a later date.
  • Scalability – Bottlenecks occurred when different events competed to update the same stateful artifact. This resulted in excessive retries or redundant updates, making it less cost-effective as the system grew.
  • Operational support – Using dedicated EC2 instances meant that we needed to take valuable development time to manage OS patching, handle host failures, and schedule deployments.

The following diagram illustrates our legacy architecture.

Transactional-based legacy architecture

Evolution is key

Our new architecture needed to address the deficiencies while preserving the core goal of our service: update stateful artifacts based on incoming financial events. In our case, a stateful artifact refers to a group of related financial transactions used for reconciliation. We considered the following as part of the evolution of our stack:

  • Stateless and stateful separation
  • Minimized end-to-end latency
  • Scalability

Stateless and stateful separation

In our transactional system, each ingested event results in an update to a stateful artifact. This became a problem when thousands of events came in all at once for the same stateful artifact.

However, by ingesting batches of data, we had the opportunity to create separate stateless and stateful processing components. The stateless component performs an initial reduce operation on the input batch to group together related events. This meant that the rest of our system could operate on these smaller stateless artifacts and perform fewer write operations (fewer operations means lower costs).

The stateful component would then join these stateless artifacts with existing stateful artifacts to produce an updated stateful artifact.

As an example, imagine an online retailer suddenly received thousands of purchases for a popular item. Instead of updating an item database entry thousands of times, we can first produce a single stateless artifact that summaries the latest purchases. The item entry can now be updated one time with the stateless artifact, reducing the update bottleneck. The following diagram illustrates this process.

Batch visualization

Minimized end-to-end latency

Unlike traditional extract, transform, and load (ETL) jobs, we didn’t want to perform daily or even hourly extracts. Our accountants need to be able to access the updated stateful artifacts within minutes of data arriving in our system. For instance, if they had manually sent a correction line, they wanted to be able to check within the same hour that their adjustment had the intended effect on the targeted stateful artifact instead of waiting until the next day. As such, we focused on parallelizing the incoming batches of data as much as possible by breaking down the individual tasks of the stateful component into subcomponents. Each subcomponent could run independently of each other, which allowed us to process multiple batches in an assembly line format.

Scalability

Both the stateless and stateful components needed to respond to shifting traffic patterns and possible input batch backlogs. We also wanted to incorporate serverless compute to better respond to scale while reducing the overhead of maintaining an instance fleet.

This meant we couldn’t simply have a one-to-one mapping between the input batch and stateless artifact. Instead, we built flexibility into our service so the stateless component could automatically detect a backlog of input batches and group multiple input batches together in one job. Similar backlog management logic was applied to the stateful component. The following diagram illustrates this process.

Batch scalability

Current architecture

To meet our needs, we combined multiple AWS products:

  • AWS Step Functions – Orchestration of our stateless and stateful workflows
  • Amazon EMR – Apache Spark operations on our stateless and stateful artifacts
  • AWS Lambda – Stateful artifact indexing and orchestration backlog management
  • Amazon ElastiCache – Optimizing Amazon S3 request latency
  • Amazon S3 – Scalable storage of our stateless and stateful artifacts
  • Amazon DynamoDB – Stateless and stateful artifact index

The following diagram illustrates our current architecture.

Current architecture

The following diagram shows our stateless and stateful workflow.

Flowchart

The AWS CloudFormation template to render this architecture and corresponding Java code is available in the following GitHub repo.

Stateless workflow

We used an Apache Spark application on a long-running Amazon EMR cluster to simultaneously ingest input batch data and perform reduce operations to produce the stateless artifacts and a corresponding index file for the stateful processing to use.

We chose Amazon EMR for its proven highly available data-processing capability in a production setting and also its ability to horizontally scale when we see increased traffic loads. Most importantly, Amazon EMR had lower cost and better operational support when compared to a self-managed cluster.

Stateful workflow

Each stateful workflow performs operations to create or update millions of stateful artifacts using the stateless artifacts. Similar to the stateless workflows, all stateful artifacts are stored in Amazon S3 across a handful of Apache Spark part-files. This alone resulted in a huge cost reduction, because we significantly reduced the number of Amazon S3 writes (while using the same amount of overall storage). For instance, storing 10 million individual artifacts using the transactional legacy architecture would cost $50 in PUT requests alone, whereas 10 Apache Spark part-files would cost only $0.00005 in PUT requests (based on $0.005 per 1,000 requests).

However, we still needed a way to retrieve individual stateful artifacts, because any stateful artifact could be updated at any point in the future. To do this, we turned to DynamoDB. DynamoDB is a fully managed and scalable key-value and document database. It’s ideal for our access pattern because we wanted to index the location of each stateful artifact in the stateful output file using its unique identifier as a primary key. We used DynamoDB to index the location of each stateful artifact within the stateful output file. For instance, if our artifact represented orders, we would use the order ID (which has high cardinality) as the partition key, and store the file location, byte offset, and byte length of each order as separate attributes. By passing the byte-range in Amazon S3 GET requests, we can now fetch individual stateful artifacts as if they were stored independently. We were less concerned about optimizing the number of Amazon S3 GET requests because the GET requests are over 10 times cheaper than PUT requests.

Overall, this stateful logic was split across three serial subcomponents, which meant that three separate stateful workflows could be operating at any given time.

Pre-fetcher

The following diagram illustrates our pre-fetcher subcomponent.

Prefetcher architecture

The pre-fetcher subcomponent uses the stateless index file to retrieve pre-existing stateful artifacts that should be updated. These might be previous shipments for the same customer order, or past inventory movements for the same warehouse. For this, we turn once again to Amazon EMR to perform this high-throughput fetch operation.

Each fetch required a DynamoDB lookup and an Amazon S3 GET partial byte-range request. Due to the large number of external calls, fetches were highly parallelized using a thread pool contained within an Apache Spark flatMap operation. Pre-fetched stateful artifacts were consolidated into an output file that was later used as input to the stateful processing engine.

Stateful processing engine

The following diagram illustrates the stateful processing engine.

Stateful processor architecture

The stateful processing engine subcomponent joins the pre-fetched stateful artifacts with the stateless artifacts to produce updated stateful artifacts after applying custom business logic. The updated stateful artifacts are written out across multiple Apache Spark part-files.

Because stateful artifacts could have been indexed at the same time that they were pre-fetched (also called in-flight updates), the stateful processor also joins recently processed Apache Spark part-files.

We again used Amazon EMR here to take advantage of the Apache Spark operations that are required to join the stateless and stateful artifacts.

State indexer

The following diagram illustrates the state indexer.

State Indexer architecture

This Lambda-based subcomponent records the location of each stateful artifact within the stateful part-file in DynamoDB. The state indexer also caches the stateful artifacts in an Amazon ElastiCache for Redis cluster to provide a performance boost in the Amazon S3 GET requests performed by the pre-fetcher.

However, even with a thread pool, a single Lambda function isn’t powerful enough to index millions of stateful artifacts within the 15-minute time limit. Instead, we employ a cluster of Lambda functions. The state indexer begins with a single coordinator Lambda function, which determines the number of worker functions that are needed. For instance, if 100 part-files are generated by the stateful processing engine, then the coordinator might assign five part-files for each of the 20 Lambda worker functions to work on. This method is highly scalable because we can dynamically assign more or fewer Lambda workers as required.

Each Lambda worker then performs the ElastiCache and DynamoDB writes for all the stateful artifacts within each assigned part-file in a multi-threaded manner. The coordinator function monitors the health of each Lambda worker and restarts workers as needed.

Distributed Lambda architecture

Orchestration

We used Step Functions to coordinate each of the stateless and stateful workflows, as shown in the following diagram.

Step Function Workflow

Every time a new workflow step ran, the step was recorded in a DynamoDB table via a Lambda function. This table not only maintained the order in which stateful batches should be run, but it also formed the basis of the backlog management system, which directed the stateless ingestion engine to group more or fewer input batches together depending on the backlog.

We chose Step Functions for its native integration with many AWS services (including triggering by an Amazon CloudWatch scheduled event rule and adding Amazon EMR steps) and its built-in support for backoff retries and complex state machine logic. For instance, we defined different backoff retry rates based on the type of error.

Conclusion

Our batch-based architecture helped us overcome the transactional processing limitations we originally set out to resolve:

  • Reduced cost – We have been able to scale to thousands of workflows and hundreds of million events per day using only three or four core nodes per EMR cluster. This reduced our Amazon EC2 usage by over 90% when compared with a similar transactional system. Additionally, writing out batches instead of individual transactions reduced the number of Amazon S3 PUT requests by over 99.8%.
  • Data completeness guarantees – Because each input batch is associated with a time interval, when a batch has finished processing, we know that all events in that time interval have been completed.
  • Simplified retry mechanisms – Batch processing means that failures occur at the batch level and can be retried directly through the workflow. Because there are far fewer batches than transactions, batch retries are much more manageable. For instance, in our service, a typical batch contains about two million entries. During a service outage, only a single batch needs to be retried, as opposed to two million individual entries in the legacy architecture.
  • High scalability – We’ve been impressed with how easy it is to scale our EMR clusters on the fly if we detect an increase in traffic. Using Amazon EMR instance fleets also helps us automatically choose the most cost-effective instances across different Availability Zones. We also like the performance achieved by our Lambda-based state indexer. This subcomponent not only dynamically scales with no human intervention, but has also been surprisingly cost-efficient. A large portion of our usage has fallen within the free tier.
  • Operational excellence – Replacing traditional hosts with serverless components such as Lambda allowed us to spend less time on compliance tickets and focus more on delivering features for our customers.

We are particularly excited about the investments we have made moving from a transactional-based system to a batch processing system, especially our shift from using Amazon EC2 to using serverless Lambda and big data Amazon EMR services. This experience demonstrates that even services originally built on AWS can still achieve cost reductions and improve performance by rethinking how AWS services are used.

Inspired by our progress, our team is moving to replace many other legacy services with serverless components. Likewise, we hope that other engineering teams can learn from our experience, continue to innovate, and do more with less.

Find the code used for this post in the following GitHub repository.

Special thanks to development team: Ryan Schwartz, Abhishek Sahay, Cecilia Cho, Godot Bian, Sam Lam, Jean-Christophe Libbrecht, and Nicholas Leong.


About the Authors


Tom Jin is a Senior Software Engineer for eCommerce Financial Integration (eCFI) at Amazon. His interests include building large-scale systems and applying machine learning to healthcare applications. He is based in Vancouver, Canada and is a fan of ocean conservation.

Karthik Odapally is a Senior Solutions Architect at AWS supporting our Gaming Customers. He loves presenting at external conferences like AWS Re:Invent, and helping customers learn about AWS. His passion outside of work is to bake cookies and bread for family and friends here in the PNW. In his spare time, he plays Legend of Zelda (Link’s Awakening) with his 4 yr old daughter.

How ENGIE scales their data ingestion pipelines using Amazon MWAA

Post Syndicated from Anouar Zaaber original https://aws.amazon.com/blogs/big-data/how-engie-scales-their-data-ingestion-pipelines-using-amazon-mwaa/

ENGIE—one of the largest utility providers in France and a global player in the zero-carbon energy transition—produces, transports, and deals electricity, gas, and energy services. With 160,000 employees worldwide, ENGIE is a decentralized organization and operates 25 business units with a high level of delegation and empowerment. ENGIE’s decentralized global customer base had accumulated lots of data, and it required a smarter, unique approach and solution to align its initiatives and provide data that is ingestible, organizable, governable, sharable, and actionable across its global business units.

In 2018, the company’s business leadership decided to accelerate its digital transformation through data and innovation by becoming a data-driven company. Yves Le Gélard, chief digital officer at ENGIE, explains the company’s purpose: “Sustainability for ENGIE is the alpha and the omega of everything. This is our raison d’être. We help large corporations and the biggest cities on earth in their attempts to transition to zero carbon as quickly as possible because it is actually the number one question for humanity today.”

ENGIE, as with any other big enterprise, is using multiple extract, transform, and load (ETL) tools to ingest data into their data lake on AWS. Nevertheless, they usually have expensive licensing plans. “The company needed a uniform method of collecting and analyzing data to help customers manage their value chains,” says Gregory Wolowiec, the Chief Technology Officer who leads ENGIE’s data program. ENGIE wanted a free-license application, well integrated with multiple technologies and with a continuous integration, continuous delivery (CI/CD) pipeline to more easily scale all their ingestion process.

ENGIE started using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to solve this issue and started moving various data sources from on-premise applications and ERPs, AWS services like Amazon Redshift, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, external services like Salesforce, and other cloud providers to a centralized data lake on top of Amazon Simple Storage Service (Amazon S3).

Amazon MWAA is used in particular to collect and store harmonized operational and corporate data from different on-premises and software as a service (SaaS) data sources into a centralized data lake. The purpose of this data lake is to create a “group performance cockpit” that enables an efficient, data-driven analysis and thoughtful decision-making by the Engie Management board.

In this post, we share how ENGIE created a CI/CD pipeline for an Amazon MWAA project template using an AWS CodeCommit repository and plugged it into AWS CodePipeline to build, test, and package the code and custom plugins. In this use case, we developed a custom plugin to ingest data from Salesforce based on the Airflow Salesforce open-source plugin.

Solution overview

The following diagrams illustrate the solution architecture defining the implemented Amazon MWAA environment and its associated pipelines. It also describes the customer use case for Salesforce data ingestion into Amazon S3.

The following diagram shows the architecture of the deployed Amazon MWAA environment and the implemented pipelines.

The preceding architecture is fully deployed via infrastructure as code (IaC). The implementation includes the following:

  • Amazon MWAA environment – A customizable Amazon MWAA environment packaged with plugins and requirements and configured in a secure manner.
  • Provisioning pipeline – The admin team can manage the Amazon MWAA environment using the included CI/CD provisioning pipeline. This pipeline includes a CodeCommit repository plugged into CodePipeline to continuously update the environment and its plugins and requirements.
  • Project pipeline – This CI/CD pipeline comes with a CodeCommit repository that triggers CodePipeline to continuously build, test and deploy DAGs developed by users. Once deployed, these DAGs are made available in the Amazon MWAA environment.

The following diagram shows the data ingestion workflow, which includes the following steps:

  1. The DAG is triggered by Amazon MWAA manually or based on a schedule.
  2. Amazon MWAA initiates data collection parameters and calculates batches.
  3. Amazon MWAA distributes processing tasks among its workers.
  4. Data is retrieved from Salesforce in batches.
  5. Amazon MWAA assumes an AWS Identity and Access Management (IAM) role with the necessary permissions to store the collected data into the target S3 bucket.

This AWS Cloud Development Kit (AWS CDK) construct is implemented with the following security best practices:

  • With the principle of least privilege, you grant permissions to only the resources or actions that users need to perform tasks.
  • S3 buckets are deployed with security compliance rules: encryption, versioning, and blocking public access.
  • Authentication and authorization management is handled with AWS Single Sign-On (AWS SSO).
  • Airflow stores connections to external sources in a secure manner either in Airflow’s default secrets backend or an alternative secrets backend such as AWS Secrets Manager or AWS Systems Manager Parameter Store.

For this post, we step through a use case using the data from Salesforce to ingest it into an ENGIE data lake in order to transform it and build business reports.

Prerequisites for deployment

For this walkthrough, the following are prerequisites:

  • Basic knowledge of the Linux operating system
  • Access to an AWS account with administrator or power user (or equivalent) IAM role policies attached
  • Access to a shell environment or optionally with AWS CloudShell

Deploy the solution

To deploy and run the solution, complete the following steps:

  1. Install AWS CDK.
  2. Bootstrap your AWS account.
  3. Define your AWS CDK environment variables.
  4. Deploy the stack.

Install AWS CDK

The described solution is fully deployed with AWS CDK.

AWS CDK is an open-source software development framework to model and provision your cloud application resources using familiar programming languages. If you want to familiarize yourself with AWS CDK, the AWS CDK Workshop is a great place to start.

Install AWS CDK using the following commands:

npm install -g aws-cdk
# To check the installation
cdk --version

Bootstrap your AWS account

First, you need to make sure the environment where you’re planning to deploy the solution to has been bootstrapped. You only need to do this one time per environment where you want to deploy AWS CDK applications. If you’re unsure whether your environment has been bootstrapped already, you can always run the command again:

cdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION

Define your AWS CDK environment variables

On Linux or MacOS, define your environment variables with the following code:

export CDK_DEFAULT_ACCOUNT=YOUR_ACCOUNT_ID
export CDK_DEFAULT_REGION=YOUR_REGION

On Windows, use the following code:

setx CDK_DEFAULT_ACCOUNT YOUR_ACCOUNT_ID
setx CDK_DEFAULT_REGION YOUR_REGION

Deploy the stack

By default, the stack deploys a basic Amazon MWAA environment with the associated pipelines described previously. It creates a new VPC in order to host the Amazon MWAA resources.

The stack can be customized using the parameters listed in the following table.

To pass a parameter to the construct, you can use the AWS CDK runtime context. If you intend to customize your environment with multiple parameters, we recommend using the cdk.json context file with version control to avoid unexpected changes to your deployments. Throughout our example, we pass only one parameter to the construct. Therefore, for the simplicity of the tutorial, we use the the --context or -c option to the cdk command, as in the following example:

cdk deploy -c paramName=paramValue -c paramName=paramValue ...
Parameter Description Default Valid values
vpcId VPC ID where the cluster is deployed. If none, creates a new one and needs the parameter cidr in that case. None VPC ID
cidr The CIDR for the VPC that is created to host Amazon MWAA resources. Used only if the vpcId is not defined. 172.31.0.0/16 IP CIDR
subnetIds Comma-separated list of subnets IDs where the cluster is deployed. If none, looks for private subnets in the same Availability Zone. None Subnet IDs list (coma separated)
envName Amazon MWAA environment name MwaaEnvironment String
envTags Amazon MWAA environment tags None See the following JSON example: '{"Environment":"MyEnv", "Application":"MyApp", "Reason":"Airflow"}'
environmentClass Amazon MWAA environment class mw1.small mw1.small, mw1.medium, mw1.large
maxWorkers Amazon MWAA maximum workers 1 int
webserverAccessMode Amazon MWAA environment access mode (private or public) PUBLIC_ONLY PUBLIC_ONLY, PRIVATE_ONLY
secretsBackend Amazon MWAA environment secrets backend Airflow SecretsManager

Clone the GitHub repository:

git clone https://github.com/aws-samples/cdk-amazon-mwaa-cicd

Deploy the stack using the following command:

cd mwaairflow && \
pip install . && \
cdk synth && \
cdk deploy -c vpcId=YOUR_VPC_ID

The following screenshot shows the stack deployment:

The following screenshot shows the deployed stack:

Create solution resources

For this walkthrough, you should have the following prerequisites:

If you don’t have a Salesforce account, you can create a SalesForce developer account:

  1. Sign up for a developer account.
  2. Copy the host from the email that you receive.
  3. Log in into your new Salesforce account
  4. Choose the profile icon, then Settings.
  5. Choose Reset my Security Token.
  6. Check your email and copy the security token that you receive.

After you complete these prerequisites, you’re ready to create the following resources:

  • An S3 bucket for Salesforce output data
  • An IAM role and IAM policy to write the Salesforce output data on Amazon S3
  • A Salesforce connection on the Airflow UI to be able to read from Salesforce
  • An AWS connection on the Airflow UI to be able to write on Amazon S3
  • An Airflow variable on the Airflow UI to store the name of the target S3 bucket

Create an S3 bucket for Salesforce output data

To create an output S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.

The Create bucket wizard opens.

  1. For Bucket name, enter a DNS-compliant name for your bucket, such as airflow-blog-post.
  2. For Region, choose the Region where you deployed your Amazon MWAA environment, for example, US East (N. Virginia) us-east-1.
  3. Choose Create bucket.

For more information, see Creating a bucket.

Create an IAM role and IAM policy to write the Salesforce output data on Amazon S3

In this step, we create an IAM policy that allows Amazon MWAA to write on your S3 bucket.

  1. On the IAM console, in the navigation pane, choose Policies.
  2. Choose Create policy.
  3. Choose the JSON tab.
  4. Enter the following JSON policy document, and replace airflow-blog-post with your bucket name:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": ["s3:ListBucket"],
          "Resource": ["arn:aws:s3:::airflow-blog-post"]
        },
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": ["arn:aws:s3:::airflow-blog-post/*"]
        }
      ]
    }

  5. Choose Next: Tags.
  6. Choose Next: Review.
  7. For Name, choose a name for your policy (for example, airflow_data_output_policy).
  8. Choose Create policy.

Let’s attach the IAM policy to a new IAM role that we use in our Airflow connections.

  1. On the IAM console, choose Roles in the navigation pane and then choose Create role.
  2. In the Or select a service to view its use cases section, choose S3.
  3. For Select your use case, choose S3.
  4. Search for the name of the IAM policy that we created in the previous step (airflow_data_output_role) and select the policy.
  5. Choose Next: Tags.
  6. Choose Next: Review.
  7. For Role name, choose a name for your role (airflow_data_output_role).
  8. Review the role and then choose Create role.

You’re redirected to the Roles section.

  1. In the search box, enter the name of the role that you created and choose it.
  2. Copy the role ARN to use later to create the AWS connection on Airflow.

Create a Salesforce connection on the Airflow UI to be able to read from Salesforce

To read data from Salesforce, we need to create a connection using the Airflow user interface.

  1. On the Airflow UI, choose Admin.
  2. Choose Connections, and then the plus sign to create a new connection.
  3. Fill in the fields with the required information.

The following table provides more information about each value.

Field Mandatory Description Values
Conn Id Yes Connection ID to define and to be used later in the DAG For example, salesforce_connection
Conn Type Yes Connection type HTTP
Host Yes Salesforce host name host-dev-ed.my.salesforce.com or host.lightning.force.com. Replace the host with your Salesforce host and don’t add the http:// as prefix.
Login Yes The Salesforce user name. The user must have read access to the salesforce objects. [email protected]
Password Yes The corresponding password for the defined user. MyPassword123
Port No Salesforce instance port. By default, 443. 443
Extra Yes Specify the extra parameters (as a JSON dictionary) that can be used in the Salesforce connection. security_token is the Salesforce security token for authentication. To get the Salesforce security token in your email, you must reset your security token. {"security_token":"AbCdE..."}

Create an AWS connection in the Airflow UI to be able to write on Amazon S3

An AWS connection is required to upload data into Amazon S3, so we need to create a connection using the Airflow user interface.

  1. On the Airflow UI, choose Admin.
  2. Choose Connections, and then choose the plus sign to create a new connection.
  3. Fill in the fields with the required information.

The following table provides more information about the fields.

Field Mandatory Description Value
Conn Id Yes Connection ID to define and to be used later in the DAG For example, aws_connection
Conn Type Yes Connection type Amazon Web Services
Extra Yes It is required to specify the Region. You also need to provide the role ARN that we created earlier.
{
"region":"eu-west-1",
"role_arn":"arn:aws:iam::123456789101:role/airflow_data_output_role "
}

Create an Airflow variable on the Airflow UI to store the name of the target S3 bucket

We create a variable to set the name of the target S3 bucket. This variable is used by the DAG. So, we need to create a variable using the Airflow user interface.

  1. On the Airflow UI, choose Admin.
  2. Choose Variables, then choose the plus sign to create a new variable.
  3. For Key, enter bucket_name.
  4. For Val, enter the name of the S3 bucket that you created in a previous step (airflow-blog-post).

Create and deploy a DAG in Amazon MWAA

To be able to ingest data from Salesforce into Amazon S3, we need to create a DAG (Directed Acyclic Graph). To create and deploy the DAG, complete the following steps:

  1. Create a local Python DAG.
  2. Deploy your DAG using the project CI/CD pipeline.
  3. Run your DAG on the Airflow UI.
  4. Display your data in Amazon S3 (with S3 Select).

Create a local Python DAG

The provided SalesForceToS3Operator allows you to ingest data from Salesforce objects to an S3 bucket. Refer to standard Salesforce objects for the full list of objects you can ingest data from with this Airflow operator.

In this use case, we ingest data from the Opportunity Salesforce object. We retrieve the last 6 months’ data in monthly batches and we filter on a specific list of fields.

The DAG provided in the sample in GitHub repository imports the last 6 months of the Opportunity object (one file by month) by filtering the list of retrieved fields.

This operator takes two connections as parameters:

  • An AWS connection that is used to upload ingested data into Amazon S3.
  • A Salesforce connection to read data from Salesforce.

The following table provides more information about the parameters.

Parameter Type Mandatory Description
sf_conn_id string Yes Name of the Airflow connection that has the following information:

  • user name
  • password
  • security token
sf_obj string Yes Name of the relevant Salesforce object (Account, Lead, Opportunity)
s3_conn_id string Yes The destination S3 connection ID
s3_bucket string Yes The destination S3 bucket
s3_key string Yes The destination S3 key
sf_fields string No The (optional) list of fields that you want to get from the object (Id, Name, and so on).
If none (the default), then this gets all fields for the object.
fmt string No The (optional) format that the S3 key of the data should be in.
Possible values include CSV (default), JSON, and NDJSON.
from_date date format No A specific date-time (optional) formatted input to run queries from for incremental ingestion.
Evaluated against the SystemModStamp attribute.
Not compatible with the query parameter and should be in date-time format (for example, 2021-01-01T00:00:00Z).
Default: None
to_date date format No A specific date-time (optional) formatted input to run queries to for incremental ingestion.
Evaluated against the SystemModStamp attribute.
Not compatible with the query parameter and should be in date-time format (for example, 2021-01-01T00:00:00Z).
Default: None
query string No A specific query (optional) to run for the given object.
This overrides default query creation.
Default: None
relationship_object string No Some queries require relationship objects to work, and these are not the same names as the Salesforce object.
Specify that relationship object here (optional).
Default: None
record_time_added boolean No Set this optional value to true if you want to add a Unix timestamp field to the resulting data that marks when the data was fetched from Salesforce.
Default: False
coerce_to_timestamp boolean No Set this optional value to true if you want to convert all fields with dates and datetimes into Unix timestamp (UTC).
Default: False

The first step is to import the operator in your DAG:

from operators.salesforce_to_s3_operator import SalesforceToS3Operator

Then define your DAG default ARGs, which you can use for your common task parameters:

# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
    'owner': '[email protected]',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'retries': 0,
    'retry_delay': timedelta(minutes=1),
    'sf_conn_id': 'salesforce_connection',
    's3_conn_id': 'aws_connection',
    's3_bucket': 'salesforce-to-s3',
}
...

Finally, you define the tasks to use the operator.

The following examples illustrate some use cases.

Salesforce object full ingestion

This task ingests all the content of the Salesforce object defined in sf_obj. This selects all the object’s available fields and writes them into the defined format in fmt. See the following code:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

Salesforce object partial ingestion based on fields

This task ingests specific fields of the Salesforce object defined in sf_obj. The selected fields are defined in the optional sf_fields parameter. See the following code:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    sf_fields=["Id","Name","Amount"],
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

Salesforce object partial ingestion based on time period

This task ingests all the fields of the Salesforce object defined in sf_obj. The time period can be relative using from_date or to_date parameters or absolute by using both parameters.

The following example illustrates relative ingestion from the defined date:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    from_date="YESTERDAY",
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

The from_date and to_date parameters support Salesforce date-time format. It can be either a specific date or literal (for example TODAY, LAST_WEEK, LAST_N_DAYS:5). For more information about date formats, see Date Formats and Date Literals.

For the full DAG, refer to the sample in GitHub repository.

This code dynamically generates tasks that run queries to retrieve the data of the Opportunity object in the form of 1-month batches.

The sf_fields parameter allows us to extract only the selected fields from the object.

Save the DAG locally as salesforce_to_s3.py.

Deploy your DAG using the project CI/CD pipeline

As part of the CDK deployment, a CodeCommit repository and CodePipeline pipeline were created in order to continuously build, test, and deploy DAGs into your Amazon MWAA environment.

To deploy the new DAG, the source code should be committed to the CodeCommit repository. This triggers a CodePipeline run that builds, tests, and deploys your new DAG and makes it available in your Amazon MWAA environment.

  1. Sign in to the CodeCommit console in your deployment Region.
  2. Under Source, choose Repositories.

You should see a new repository mwaaproject.

  1. Push your new DAG in the mwaaproject repository under dags. You can either use the CodeCommit console or the Git command line to do so:
    1. CodeCommit console:
      1. Choose the project CodeCommit repository name mwaaproject and navigate under dags.
      2. Choose Add file and then Upload file and upload your new DAG.
    2. Git command line:
      1. To be able to clone and access your CodeCommit project with the Git command line, make sure Git client is properly configured. Refer to Setting up for AWS CodeCommit.
      2. Clone the repository with the following command after replacing <region> with your project Region:
        git clone https://git-codecommit.<region>.amazonaws.com/v1/repos/mwaaproject

      3. Copy the DAG file under dags and add it with the command:
        git add dags/salesforce_to_s3.py

      4. Commit your new file with a message:
        git commit -m "add salesforce DAG"

      5. Push the local file to the CodeCommit repository:
        git push

The new commit triggers a new pipeline that builds, tests, and deploys the new DAG. You can monitor the pipeline on the CodePipeline console.

  1. On the CodePipeline console, choose Pipeline in the navigation pane.
  2. On the Pipelines page, you should see mwaaproject-pipeline.
  3. Choose the pipeline to display its details.

After checking that the pipeline run is successful, you can verify that the DAG is deployed to the S3 bucket and therefore available on the Amazon MWAA console.

  1. On the Amazon S3 console, look for a bucket starting with mwaairflowstack-mwaaenvstackne and go under dags.

You should see the new DAG.

  1. On the Amazon MWAA console, choose DAGs.

You should be able to see the new DAG.

Run your DAG on the Airflow UI

Go to the Airflow UI and toggle on the DAG.

This triggers your DAG automatically.

Later, you can continue manually triggering it by choosing the run icon.

Choose the DAG and Graph View to see the run of your DAG.

If you have any issue, you can check the logs of the failed tasks from the task instance context menu.

Display your data in Amazon S3 (with S3 Select)

To display your data, complete the following steps:

  1. On the Amazon S3 console, in the Buckets list, choose the name of the bucket that contains the output of the Salesforce data (airflow-blog-post).
  2. In the Objects list, choose the name of the folder that has the object that you copied from Salesforce (opportunity).
  3. Choose the raw folder and the dt folder with the latest timestamp.
  4. Select any file.
  5. On the Actions menu, choose Query with S3 Select.
  6. Choose Run SQL query to preview the data.

Clean up

To avoid incurring future charges, delete the AWS CloudFormation stack and the resources that you deployed as part of this post.

  1. On the AWS CloudFormation console, delete the stack MWAAirflowStack.

To clean up the deployed resources using the AWS Command Line Interface (AWS CLI), you can simply run the following command:

cdk destroy MWAAirflowStack

Make sure you are in the root path of the project when you run the command.

After confirming that you want to destroy the CloudFormation stack, the solution’s resources are deleted from your AWS account.

The following screenshot shows the process of deploying the stack:

The following screenshot confirms the stack is undeployed.

  1. Navigate to the Amazon S3 console and locate the two buckets containing mwaairflowstack-mwaaenvstack and mwaairflowstack-mwaaproj that were created during the deployment.
  2. Select each bucket delete its contents, then delete the bucket.
  3. Delete the IAM role created to write on the S3 buckets.

Conclusion

ENGIE discovered significant value by using Amazon MWAA, enabling its global business units to ingest data in more productive ways. This post presented how ENGIE scaled their data ingestion pipelines using Amazon MWAA. The first part of the post described the architecture components and how to successfully deploy a CI/CD pipeline for an Amazon MWAA project template using a CodeCommit repository and plug it into CodePipeline to build, test, and package the code and custom plugins. The second part walked you through the steps to automate the ingestion process from Salesforce using Airflow with an example. For the Airflow configuration, you used Airflow variables, but you can also use Secrets Manager with Amazon MWAA using the secretsBackend parameter when deploying the stack.

The use case discussed in this post is just one example of how you can use Amazon MWAA to make it easier to set up and operate end-to-end data pipelines in the cloud at scale. For more information about Amazon MWAA, check out the User Guide.


About the Authors

Anouar Zaaber is a Senior Engagement Manager in AWS Professional Services. He leads internal AWS, external partner, and customer teams to deliver AWS cloud services that enable the customers to realize their business outcomes.

Amine El Mallem is a Data/ML Ops Engineer in AWS Professional Services. He works with customers to design, automate, and build solutions on AWS for their business needs.

Armando Segnini is a Data Architect with AWS Professional Services. He spends his time building scalable big data and analytics solutions for AWS Enterprise and Strategic customers. Armando also loves to travel with his family all around the world and take pictures of the places he visits.

Mohamed-Ali Elouaer is a DevOps Consultant with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems related to automation, security, and monitoring using AWS services. In his free time, he likes to travel and watch movies.

Julien Grinsztajn is an Architect at ENGIE. He is part of the Digital & IT Consulting ENGIE IT team working on the definition of the architecture for complex projects related to data integration and network security. In his free time, he likes to travel the oceans to meet sharks and other marine creatures.

Enhance Your Contact Center Solution with Automated Voice Authentication and Visual IVR

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/enhance-your-contact-center-solution-with-automated-voice-authentication-and-visual-ivr/

Recently, the Accenture AWS Business Group (AABG) assisted a customer in developing a secure and personalized Interactive Voice Response (IVR) contact center experience that receives and processes payments and responds to customer inquiries.

Our solution uses Amazon Connect at its core to help customers efficiently engage with customer service agents. To ensure transactions are completed securely and to prevent fraud, the architecture provides voice authentication using Amazon Connect Voice ID and a visual portal to submit payments. The visual IVR feature allows customers to easily provide the required information online while the IVR is on standby. The solution also provides agents the information they need to effectively and efficiently understand and resolve callers’ inquiries, which helps improve the quality of their service.

Overview of solution

Our IVR is designed using Contact Flows on Amazon Connect and uses the following services:

  • Amazon Lex provides the voice-based intent analysis. Intent analysis is the process of determining the underlying intention behind customer interactions.
  • Amazon Connect integrates with other AWS services using AWS Lambda.
  • Amazon DynamoDB stores customer data.
  • Amazon Pinpoint notifies customers via text and email.
  • AWS Amplify provides the customized agent dashboard and generates the visual IVR portal.

Figure 1 shows how this architecture routes customer calls:

  1. Callers dial the main line to interact with the IVR in Amazon Connect.
  2. Amazon Connect Voice ID sets up a voiceprint for first time callers or performs voice authentication for repeat callers for added security.
  3. Upon successful voice authentication, callers can proceed to IVR self-service functions, such as checking their account balance or making a payment. Amazon Lex handles the voice intent analysis.
  4. When callers make a payment request, they are given the option to be handed off securely to a visual IVR portal to process their payment.
  5. If a caller requests to be connected to an agent, the agent will be presented with the customer’s information and IVR interaction details on their agent dashboard.
Architecture diagram

Figure 1. Architecture diagram

Customer IVR experience

Figure 2 describes how callers navigate through the IVR:

  1. The IVR asks the caller the purpose of the call.
  2. The caller’s answer is sent for voice intent analysis. The IVR also attempts to authenticate the caller’s voice using Amazon Connect Voice ID. If authenticated, the caller is automatically routed to the correct flow based on the analyzed intent.
    • For the “Account Balance” flow, the caller is provided the account balance information.
    • For the “Make a Payment” flow, the caller can use the IVR or a visual IVR portal to process the payment. Upon payment completion, the caller is immediately notified their transaction has completed via SMS or email. Both flows allow the caller to be transferred to an agent. The caller also has the option to be called back when an agent becomes available or choose a specific date and time for the callback.
Customer IVR experience diagram

Figure 2. Customer IVR experience diagram

The intelligent self-service IVR solution includes the following features:

  • The IVR can redirect callers to a payment portal for scenarios like making a payment while the IVR remains on standby.
  • IVR transaction tracking helps agents understand the current status of the caller’s transaction and quickly determines the caller’s situation.
  • Callers have the option to receive a call as soon as the next agent becomes available or they can schedule a time that works for them to receive a callback.
  • IVR activity logging gives agents a detailed summary of the caller’s actions within the IVR.
  • Transaction confirmation which notifies callers of successful transactions via SMS or email.

Solution walkthrough

Amazon Connect Voice ID authenticates a caller’s voice as an added level of security. It requires 30 seconds to create the initial enrollment voiceprint and 10 seconds of a caller’s voice to authenticate. If there is not enough net speech to perform the voice authentication, the IVR asks the caller more questions, such as their first name and last name, until it has collected enough net speech.

The IVR falls back to dual tone multi-frequency (DTMF) input for the caller’s credentials in case the system cannot successfully authenticate. This can include information like the last four digits of their national identification number or postal code.

In contact flows, you will enable voice authentication by adding the “Set security behavior” contact block and specifying the authentication threshold, as shown in Figure 3.

Set security behavior contact block

Figure 3. Set security behavior contact block

Figure 4 shows the “Check security status” contact block, which determines if the user has been successfully authenticated or not. It also shows results that it may return if the caller is not successfully authenticated, including, “Not authenticated,” “Inconclusive,” “Not enrolled,” “Opted out,” and “Error.”

Check security status contact block

Figure 4. Check security status contact block

Providing a personalized experience for callers

To provide a personalized experience for callers, sample customer data is stored in a DynamoDB table. A Lambda function queries this table when callers call the contact center. The query returns information about the caller, such as their name, so the IVR can offer a customized greeting.

Transaction tracking

The table can also query if a customer previously called and attempted to make a payment but didn’t complete it successfully. This feature is called “transaction tracking.” Here’s how it works:

  • When the caller progresses through the “make a payment” flow, a field in the table is updated to reflect their transaction’s status.
  • If the payment is abandoned, the status in the table remains open, and the IVR prompts the caller to pick up where they left off the next time they call.
  • Once they have successfully completed their payment, we update the status in the table to “complete.”
  • When the IVR confirms that the caller’s payment has gone through, they will receive a confirmation via SMS and email. A Lambda function in the contact flow receives the caller’s phone number and email address. Then it distributes the confirmation messages via Amazon Pinpoint.

If a call is escalated to an agent, the “Check contact attributes” contact block in Figure 5 helps to check the caller’s intent and provide the agent with a customized whisper.

Agent whisper sample contact flow

Figure 5. Agent whisper sample contact flow

Making payments via the payment portal

To make a payment, an Amazon Lex bot presents the caller with the option to provide payment details over the phone or through a visual IVR portal.

If they choose to use the visual IVR portal (Figure 6), they can enter their payment details while maintaining an open phone connection with the contact center, in case they need additional assistance. Here’s how it works:

  • When callers select to use the payment portal, it prompts a Lambda function that generates a universally unique identifier (UUID) and provides the caller a unique PIN.
  • The UUID and PIN are stored in the DynamoDB table along with the caller’s information.
  • Another Lambda function generates a secure link using the UUID. It then uses Amazon Pinpoint to send the link to the caller over text message to their phone number on record. When they open the link, they are prompted to enter their unique PIN.
  • Then, the webpage makes an API call that validates the payment request by comparing the entered PIN to the PIN stored in the DynamoDB table.
  • Once validated, the caller can enter their payment information.
Visual IVR portal

Figure 6. Visual IVR portal

Figure 7 illustrates visual IVR portal contact flow:

  • Every 10 seconds, a Lambda function checks the caller’s payment status. It provides the caller the option to escalate to an agent if they have questions.
  • If the caller does not fill out all the information when they hit “Submit Payment,” an IVR prompt will ask them to provide all payment details before proceeding.
  • The IVR phone call stays active until the user’s payment status is updated to “complete” in the DynamoDB table. This generates an IVR prompt stating that their payment was successful.
Visual IVR portal sample contact flow

Figure 7. Visual IVR portal sample contact flow

Generating a chat transcript for agents

When the customer’s call is escalated to an agent, the agent receives a chat transcript. Here’s how it works:

  • After the caller’s intent is captured at the start of the call, the IVR logs activity using a “Set contact attribute” contact block, which prompts the $.Lex.SessionAttributes.transcript.
  • This transcript is used in a Lambda function to build a chat interface.
  • This transcript is shown on the agent’s dashboard, along with the Contact Control Panel (CCP) and a few key pieces of caller information.
IVR transcript

Figure 8. IVR transcript

The agent’s customized dashboard and the visual IVR portal are deployed and hosted on Amplify. This allows us to seamlessly connect to our code repository and automate deployments after changes are committed. It removed the need to configure Amazon Simple Storage Service (Amazon S3) buckets, an Amazon CloudFront distribution, and Amazon Route 53 DNS to host our front-end components.

This solution also offers callers the ability to opt-in for a callback or to schedule a callback. A “Check queue status” contact block checks the current time in queue, and if it reaches a certain threshold, the IVR will offer a callback. The caller has the option to receive a call as soon as the next agent becomes available or to schedule a time to receive a callback. A Lex bot gathers the date and time slots, which are then passed to a Lambda function that will validate the proposed callback option.

Once confirmed, the scheduled callback is placed into a DynamoDB table along with the caller’s phone number. Another Lambda function scans the table every 5 minutes to see if there are any callbacks scheduled within that 5-minute time period. You’ll add an Amazon EventBridge prompt to the Lambda function that specifies a schedule expression like cron(0/5 8-17 ? * MON-FRI *), which means the Lambda function will execute every 5 minutes, Monday through Friday from 8:00 AM to 4:55 PM.

Conclusion

This solution helps you increase customer satisfaction by making it easier for callers to complete transactions over the phone. The visual IVR provides added web-based support experience to submit payments. It also improves the quality of service of your customer service agents by making relevant information available to agents during the call.

This solution also allows you to scale out the resources to handle increasing demand. Custom features can easily be added using serverless technology, such as Lambda functions or other cloud-native services on AWS.

Ready to get started? The AABG helps customers accelerate their pace of digital innovation and realize incremental business value from cloud adoption and transformation. Connect with our team at [email protected] to learn how to use machine learning in your products and services.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-step-functions-and-amazon-dynamodb/

This post is written by Anitha Deenadayalan, Developer Specialist SA, DevAx

Modern applications use microservices as an architectural and organizational approach to software development, where the application comprises small independent services that communicate over well-defined APIs.

When multiple microservices collaborate to handle requests, one or more services may become unavailable or exhibit a high latency. Microservices communicate through remote procedure calls, and it is always possible that transient errors could occur in the network connectivity, causing failures.

This can cause performance degradation in the entire application during synchronous execution because of the cascading of timeouts or failures causing poor user experience. When complex applications use microservices, an outage in one microservice can lead to application failure. This post shows how to use the circuit breaker design pattern to help with a graceful service degradation.

Introducing circuit breakers

Michael Nygard popularized the circuit breaker pattern in his book, Release It. This design pattern can prevent a caller service from retrying another callee service call that has previously caused repeated timeouts or failures. It can also detect when the callee service is functional again.

Fallacies of distributed computing are a set of assertions made by Peter Deutsch and others at Sun Microsystems. They say the programmers new to distributed applications invariably make false assumptions. The network reliability, zero-latency expectations, and bandwidth limitations result in software applications written with minimal error handling for network errors.

During a network outage, applications may indefinitely wait for a reply and continually consume application resources. Failure to retry the operations when the network becomes available can also lead to application degradation. If API calls to a database or an external service time-out due to network issues, repeated calls with no circuit breaker can affect cost and performance.

The circuit breaker pattern

There is a circuit breaker object that routes the calls from the caller to the callee in the circuit breaker pattern. For example, in an ecommerce application, the order service can call the payment service to collect the payments. When there are no failures, the order service routes all calls to the payment service by the circuit breaker:

Circuit breaker with no failures

Circuit breaker with no failures

If the payment service times out, the circuit breaker can detect the timeout and track the failure. If the timeouts exceed a specified threshold, the application opens the circuit:

Circuit breaker with payment service failure

Circuit breaker with payment service failure

Once the circuit is open, the circuit breaker object does not route the calls to the payment service. It returns an immediate failure when the order service calls the payment service:

Circuit breaker stops routing to payment service

Circuit breaker stops routing to payment service

The circuit breaker object periodically tries to see if the calls to the payment service are successful:

Circuit breaker retries payment service

Circuit breaker retries payment service

When the call to payment service succeeds, the circuit is closed, and all further calls are routed to the payment service again:

Circuit breaker with working payment service again

Circuit breaker with working payment service again

Architecture overview

This example uses the AWS Step Functions, AWS Lambda, and Amazon DynamoDB to implement the circuit breaker pattern:

Circuit breaker architecture

Circuit breaker architecture

The Step Functions workflow provides circuit breaker capabilities. When a service wants to call another service, it starts the workflow with the name of the callee service.

The workflow gets the circuit status from the CircuitStatus DynamoDB table, which stores the currently degraded services. If the CircuitStatus contains a record for the service called, then the circuit is open. The Step Functions workflow returns an immediate failure and exit with a FAIL state.

If the CircuitStatus table does not contain an item for the called service, then the service is operational. The ExecuteLambda step in the state machine definition invokes the Lambda function sent through a parameter value. The Step Functions workflow exits with a SUCCESS state, if the call succeeds.

The items in the DynamoDB table have the following attributes:

DynamoDB items list

DynamoDB items list

If the service call fails or a timeout occurs, the application retries with exponential backoff for a defined number of times. If the service call fails after the retries, the workflow inserts a record in the CircuitStatus table for the service with the CircuitStatus as OPEN, and the workflow exits with a FAIL state. Subsequent calls to the same service return an immediate failure as long as the circuit is open.

I enter the item with an associated time-to-live (TTL) value to ensure eventual connection retries and the item expires at the defined TTL time. DynamoDB’s time to live (TTL) allows you to define a per-item timestamp to determine when an item is no longer needed. Shortly after the date and time of the specified timestamp, DynamoDB deletes the item from your table without consuming write throughput.

For example, if you set the TTL value to 60 seconds to check a service status after a minute, DynamoDB deletes the item from the table after 60 seconds. The workflow invokes the service to check for availability when a new call comes in after the item has expired.

Circuit breaker Step Function

Circuit breaker Step Function

Prerequisites

For this walkthrough, you need:

Setting up the environment

Use the .NET Core 3.1 code in the GitHub repository and the AWS SAM template to create the AWS resources for this walkthrough. These include IAM roles, DynamoDB table, the Step Functions workflow, and Lambda functions.

  1. You need an AWS access key ID and secret access key to configure the AWS Command Line Interface (AWS CLI). To learn more about configuring the AWS CLI, follow these instructions.
  2. Clone the repo:
    git clone https://github.com/aws-samples/circuit-breaker-netcore-blog
  3. After cloning, this is the folder structure:

    Project file structure

    Project file structure

Deploy using Serverless Application Model (AWS SAM)

The AWS Serverless Application Model (AWS SAM) CLI provides developers with a local tool for managing serverless applications on AWS.

  1. The sam build command processes your AWS SAM template file, application code, and applicable language-specific files and dependencies. The command copies build artifacts in the format and location expected for subsequent steps in your workflow. Run these commands to process the template file:
    cd circuit-breaker
    sam build
  2. After you build the application, test using the sam deploy command. AWS SAM deploys the application to AWS and displays the output in the terminal.
    sam deploy --guided

    Output from sam deploy

    Output from sam deploy

  3. You can also view the output in AWS CloudFormation page.

    Output in CloudFormation console

    Output in CloudFormation console

  4. The Step Functions workflow provides the circuit-breaker function. Refer to the circuitbreaker.asl.json file in the statemachine folder for the state machine definition in the Amazon States Language (ASL).

To deploy with the CDK, refer to the GitHub page.

Running the service through the circuit breaker

To provide circuit breaker capabilities to the Lambda microservice, you must send the name or function ARN of the Lambda function to the Step Functions workflow:

{
  "TargetLambda": "<Name or ARN of the Lambda function>"
}

Successful run

To simulate a successful run, use the HelloWorld Lambda function provided by passing the name or ARN of the Lambda function the stack has created. Your input appears as follows:

{
  "TargetLambda": "circuit-breaker-stack-HelloWorldFunction-pP1HNkJGugQz"
}

During the successful run, the Get Circuit Status step checks the circuit status against the DynamoDB table. Suppose that the circuit is CLOSED, which is indicated by zero records for that service in the DynamoDB table. In that case, the Execute Lambda step runs the Lambda function and exits the workflow successfully.

Step Function with closed circuit

Step Function with closed circuit

Service timeout

To simulate a timeout, use the TestCircuitBreaker Lambda function by passing the name or ARN of the Lambda function the stack has created. Your input appears as:

{
  "TargetLambda": "circuit-breaker-stack-TestCircuitBreakerFunction-mKeyyJq4BjQ7"
}

Again, the circuit status is checked against the DynamoDB table by the Get Circuit Status step in the workflow. The circuit is CLOSED during the first pass, and the Execute Lambda step runs the Lambda function and timeout.

The workflow retries based on the retry count and the exponential backoff values, and finally returns a timeout error. It runs the Update Circuit Status step where a record is inserted in the DynamoDB table for that service, with a predefined time-to-live value specified by TTL attribute ExpireTimeStamp.

Step Function with open circuit

Step Function with open circuit

Repeat timeout

As long as there is an item for the service in the DynamoDB table, the circuit breaker workflow returns an immediate failure to the calling service. When you re-execute the call to the Step Functions workflow for the TestCircuitBreaker Lambda function within 20 seconds, the circuit is still open. The workflow immediately fails, ensuring the stability of the overall application performance.

Step Function workflow immediately fails until retry

Step Function workflow immediately fails until retry

The item in the DynamoDB table expires after 20 seconds, and the workflow retries the service again. This time, the workflow retries with exponential backoffs, and if it succeeds, the workflow exits successfully.

Cleaning up

To avoid incurring additional charges, clean up all the created resources. Run the following command from a terminal window. This command deletes the created resources that are part of this example.

sam delete --stack-name circuit-breaker-stack --region <region name>

Conclusion

This post showed how to implement the circuit breaker pattern using Step Functions, Lambda, DynamoDB, and .NET Core 3.1. This pattern can help prevent system degradation in service failures or timeouts. Step Functions and the TTL feature of DynamoDB can make it easier to implement the circuit breaker capabilities.

To learn more about developing microservices on AWS, refer to the whitepaper on microservices. To learn more about serverless and AWS SAM, visit the Sessions with SAM series and find more resources at Serverless Land.

Codacy Measures Developer Productivity using AWS Serverless

Post Syndicated from Catarina Gralha original https://aws.amazon.com/blogs/architecture/codacy-measures-developer-productivity-using-aws-serverless/

Codacy is a DevOps insights company based in Lisbon, Portugal. Since its launch in 2012, Codacy has helped software development and engineering teams reduce defects, keep technical debt in check, and ship better code, faster.

Codacy’s latest product, Pulse, is a service that helps understand and improve the performance of software engineering teams. This includes measuring metrics such as deployment frequency, lead time for changes, or mean time to recover. Codacy’s main platform is built on top of AWS products like Amazon Elastic Kubernetes Service (EKS), but they have taken Pulse one step further with AWS serverless.

In this post, we will explore the Pulse’s requirements, architecture, and the services it is built on, including AWS Lambda, Amazon API Gateway, and Amazon DynamoDB.

Pulse prototype requirements

Codacy had three clear requirements for their initial Pulse prototype.

  1. The solution must enable the development team to iterate quickly and have minimal time-to-market (TTM) to validate the idea.
  2. The solution must be easily scalable and match the demands of both startups and large enterprises alike. This was of special importance, as Codacy wanted to onboard Pulse with some of their existing customers. At the time, these customers already had massive amounts of information.
  3. The solution must be cost-effective, particularly during the early stages of the product development.

Enter AWS serverless

Codacy could have built Pulse on top of Amazon EC2 instances. However, this brings the undifferentiated heavy lifting of having to provision, secure, and maintain the instances themselves.

AWS serverless technologies are fully managed services that abstract the complexity of infrastructure maintenance away from developers and operators, so they can focus on building products.

Serverless applications also scale elastically and automatically behind the scenes, so customers don’t need to worry about capacity provisioning. Furthermore, these services are highly available by design and span multiple Availability Zones (AZs) within the Region in which they are deployed. This gives customers higher confidence that their systems will continue running even if one Availability Zone is impaired.

AWS serverless technologies are cost-effective too, as they are billed per unit of value, as opposed to billing per provisioned capacity. For example, billing is calculated by the amount of time a function takes to complete or the number of messages published to a queue, rather than how long an EC2 instance runs. Customers only pay when they are getting value out of the services, for example when serving an actual customer request.

Overview of Pulse’s solution architecture

An event is generated when a developer performs a specific action as part of their day-to-day tasks, such as committing code or merging a pull request. These events are the foundational data that Pulse uses to generate insights and are thus processed by multiple Pulse components called modules.

Let’s take a detailed look at a few of them.

Ingestion module

Figure 1. Pulse ingestion module architecture

Figure 1. Pulse ingestion module architecture

Figure 1 shows the ingestion module, which is the entry point of events into the Pulse platform and is built on AWS serverless applications as follows:

  • The ingestion API is exposed to customers using Amazon API Gateway. This defines REST, HTTP, and WebSocket APIs with sophisticated functionality such as request validation, rate limiting, and more.
  • The actual business logic of the API is implemented as AWS Lambda functions. Lambda can run custom code in a fully managed way. You only pay for the time that the function takes to run, in 1-millisecond increments. Lambda natively supports multiple languages, but customers can also bring their own runtimes or container images as needed.
  • API requests are authorized with keys, which are stored in Amazon DynamoDB, a key-value NoSQL database that delivers single-digit millisecond latency at any scale. API Gateway invokes a Lambda function that validates the key against those stored in DynamoDB (this is called a Lambda authorizer.)
  • While API Gateway provides a default domain name for each API, Codacy customizes it with Amazon Route 53, a service that registers domain names and configures DNS records. Route 53 offers a service level agreement (SLA) of 100% availability.
  • Events are stored in raw format in Pulse’s data lake, which is built on top of AWS’ object storage service, Amazon Simple Storage Service (S3). With Amazon S3, you can store massive amounts of information at low cost using simple HTTP requests. The data is highly available and durable.
  • Whenever a new event is ingested by the API, a message is published in Pulse’s message bus. (More information later in this post.)

Events module

Figure 2. Pulse events module architecture

Figure 2. Pulse events module architecture

The events module handles the aggregation and storage of events for actual consumption by customers, see Figure 2:

  • Events are consumed from the message bus and processed with a Lambda function, which stores them in Amazon Redshift.
  • Amazon Redshift is AWS’ managed data warehouse, and enables Pulse’s users to get insights and metrics by running analytical (OLAP) queries with the highest performance.
  • These metrics are exposed to customers via another API (the public API), which is also built on API Gateway.
  • The business logic for this API is implemented using Lambda functions, like the Ingestion module.

Message bus

Figure 3. Message bus architecture

Figure 3. Message bus architecture

We mentioned earlier that Pulse’s modules communicate messages with each other via the “message bus.” When something occurs at a specific component, a message (event) is published to the bus. At the same time, developers create subscriptions for each module that should receive these messages. This is known as the publisher/subscriber pattern (pub/sub for short), and is a fundamental piece of event-driven architectures.

With the message bus, you can decouple all modules from each other. In this way, a publisher does not need to worry about how many or who their subscribers are, or what to do if a new one arrives. This is all handled by the message bus.

Pulse’s message bus is built like this, shown in Figure 3:

  • Events are published via Amazon Simple Notification Service (SNS), using a construct called a topic. Topics are the basic unit of message publication and consumption. Components are subscribed to this topic, and you can filter out unwanted messages.
  • Developers configure Amazon SNS subscriptions to have the events sent to a queue, which provides a buffering layer from which workers can process messages. At the same time, queues also ensure that messages are not lost if there is an error. In Pulse’s case, these queues are implemented with Amazon Simple Queue Service (SQS).

Other modules

There are other parts of Pulse architecture that also use AWS serverless. For example, user authentication and sign-up are handled by Amazon Cognito, and Pulse’s frontend application is hosted on Amazon S3. This app is served to customers worldwide with low latency using Amazon CloudFront, a content delivery network.

Summary and next steps

By using AWS serverless, Codacy has been able to reduce the time required to bring Pulse to market by staying focused on developing business logic, rather than managing servers. Furthermore, Codacy is confident they can handle Pulse’s growth, as this serverless architecture will scale automatically according to demand.

How Ribbon Communications Built a Scalable, Resilient Robocall Mitigation Platform

Post Syndicated from Siva Rajamani original https://aws.amazon.com/blogs/architecture/how-ribbon-communications-built-a-scalable-resilient-robocall-mitigation-platform/

Ribbon Communications provides communications software, and IP and optical networking end-to-end solutions that deliver innovation, unparalleled scale, performance, and agility to service providers and enterprise.

Ribbon Communications is helping customers modernize their networks. In today’s data-hungry, 24/7 world, this equates to improved competitive positioning and business outcomes. Companies are migrating from on-premises equipment for telephony services and looking for equivalent as a service (aaS) offerings. But these solutions must still meet the stringent resiliency, availability, performance, and regulatory requirements of a telephony service.

The telephony world is inundated with robocalls. In the United States alone, there were an estimated 50.5 billion robocalls in 2021! In this blog post, we describe the Ribbon Identity Hub – a holistic solution for robocall mitigation. The Ribbon Identity Hub enables services that sign and verify caller identity, which is compliant to the ATIS standards under the STIR/SHAKEN framework. It also evaluates and scores calls for the probability of nuisance and fraud.

Ribbon Identity Hub is implemented in Amazon Web Services (AWS). It is a fully managed service for telephony service providers and enterprises. The solution is secure, multi-tenant, automatic scaling, and multi-Region, and enables Ribbon to offer managed services to a wide range of telephony customers. Ribbon ensures resiliency and performance with efficient use of resources in the telephony environment, where load ratios between busy and idle time can exceed 10:1.

Ribbon Identity Hub

The Ribbon Identity Hub services are separated into a data (call-transaction) plane, and a control plane.

Data plane (call-transaction)

The call-transaction processing is typically invoked on a per-call-setup basis where availability, resilience, and performance predictability are paramount. Additionally, due to high variability in load, automatic scaling is a prerequisite.

Figure 1. Data plane architecture

Figure 1. Data plane architecture

Several AWS services come together in a solution that meets all these important objectives:

  1. Amazon Elastic Container Service (ECS): The ECS services are set up for automatic scaling and span two Availability Zones. This provides the horizontal scaling capability, the self-healing capacity, and the resiliency across Availability Zones.
  2. Elastic Load Balancing – Application Load Balancer (ALB): This provides the ability to distribute incoming traffic to ECS services as the target. In addition, it also offers:
    • Seamless integration with the ECS Auto Scaling group. As the group grows, traffic is directed to the new instances only when they are ready. As traffic drops, traffic is drained from the target instances for graceful scale down.
    • Full support for canary and linear upgrades with zero downtime. Maintains full-service availability without any changes or even perception for the client devices.
  3. Amazon Simple Storage Service (S3): Transaction detail records associated with call-related requests must be securely and reliably maintained for over a year due to billing and other contractual obligations. Amazon S3 simplifies this task with high durability, lifecycle rules, and varied controls for retention.
  4. Amazon DynamoDB: Building resilient services is significantly easier when the compute processing can be stateless. Amazon DynamoDB facilitates such stateless architectures without compromise. Coupled with the availability of the Amazon DynamoDB Accelerator (DAX) caching layer, the solution can meet the extreme low latency operation requirements.
  5. AWS Key Management Service (KMS): Certain tenant configuration is highly confidential and requires elevated protection. Furthermore, the data is part of the state that must be recovered across Regions in disaster recovery scenarios. To meet the security requirements, the KMS is used for envelope encryption using per-tenant keys. Multi-Region KMS keys facilitates the secure availability of this state across Regions without the need for application-level intervention when replicating encrypted data.
  6. Amazon Route 53: For telephony services, any non-transient service failure is unacceptable. In addition to providing high degree of resiliency through Multi-AZ architecture, Identity Hub also provides Regional level high availability through its multi-Region active-active architecture. Route 53 with health checks provides for dynamic rerouting of requests within minutes to alternate Regions.

Control plane

The Identity Hub control plane is used for customer configuration, status, and monitoring. The API is REST-based. Since this is not used on a call-by-call basis, the requirements around latency and performance are less stringent, though the requirements around high resiliency and dynamic scaling still apply. In this area, ease of implementation and maintainability are key.

Figure 2. Control plane architecture

Figure 2. Control plane architecture

The following AWS services implement our control plane:

  1. Amazon API Gateway: Coupled with a custom authenticator, the API Gateway handles all the REST API credential verification and routing. Implementation of an API is transformed into implementing handlers for each resource, which is the application core of the API.
  2. AWS Lambda: All the REST API handlers are written as Lambda functions. By using the Lambda’s serverless and concurrency features, the application automatically gains self-healing and auto-scaling capabilities. There is also a significant cost advantage as billing is per millisecond of actual compute time used. This is significant for a control plane where usage is typically sparse and unpredictable.
  3. Amazon DynamoDB: A stateless architecture with Lambda and API Gateway, all persistent state must be stored in an external database. The database must match the resilience and auto-scaling characteristics of the rest of the control plane. DynamoDB easily fits the requirements here.

The customer portal, in addition to providing the user interface for control plane REST APIs, also delivers a rich set of user-customizable dashboards and reporting capability. Here again, the availability of various AWS services simplifies the implementation, and remains non-intrusive to the central call-transaction processing.

Services used here include:

  1. AWS Glue: Enables extraction and transformation of raw transaction data into a format useful for reporting and dashboarding. AWS Glue is particularly useful here as the data available is regularly expanding, and the use cases for the reporting and dashboarding increase.
  2. Amazon QuickSight: Provides all the business intelligence (BI) functionality, including the ability for Ribbon to offer separate author and reader access to their users, and implements tenant-based access separation.

Conclusion

Ribbon has successfully deployed Identity Hub to enable cloud hosted telephony services to mitigate robocalls. Telephony requirements around resiliency, performance, and capacity were not compromised. Identity Hub offers the benefits of a 24/7 fully managed service requiring no additional customer on-premises equipment.

Choosing AWS services for Identity Hub gives Ribbon the ability to scale and meet future growth. The ability to dynamically scale the service in and out also brings significant cost advantages in telephony applications where busy hour traffic is significantly higher than idle time traffic. In addition, the availability of global AWS services facilitates the deployment of services in customer-local geographic locations to meet performance requirements or local regulatory compliance.

Creating a Multi-Region Application with AWS Services – Part 2, Data and Replication

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-2-data-and-replication/

In Part 1 of this blog series, we looked at how to use AWS compute, networking, and security services to create a foundation for a multi-Region application.

Data is at the center of many applications. In this post, Part 2, we will look at AWS data services that offer native features to help get your data where it needs to be.

In Part 3, we’ll look at AWS application management and monitoring services to help you build, monitor, and maintain a multi-Region application.

Considerations with replicating data

Data replication across the AWS network can happen quickly, but we are still limited by the speed of light. For this reason, data consistency must be considered when building a multi-Region application. Generally speaking, the longer a physical distance is, the longer it will take the data to get there.

When building a distributed system, consider the consistency, availability, partition tolerance (CAP) theorem. This theorem states that an application can only pick 2 out of the 3, and tradeoffs should be considered.

  • Consistency – all clients always have the same view of data
  • Availability – all clients can always read and write data
  • Partition Tolerance – the system will continue to work despite physical partitions

CAP diagram

Achieving consistency and availability is common for single-Region applications. For example, when an application connects to a single in-Region database. However, this becomes more difficult with multi-Region applications due to the latency added by transferring data over long distances. For this reason, highly distributed systems will typically follow an eventual consistency approach, favoring availability and partition tolerance.

Replicating objects and files

To ensure objects are in multiple Regions, Amazon Simple Storage Service (Amazon S3) can be set up to replicate objects across AWS Regions automatically with one-way or two-way replication. A subset of objects in an S3 bucket can be replicated with S3 replication rules. If low replication lag is critical, S3 Replication Time Control can help meet requirements by replicating 99.99% of objects within 15 minutes, and most within seconds. To monitor the replication status of objects, Amazon S3 events and metrics will track replication and can send an alert if there’s an issue.

Traditionally, each S3 bucket has its own single, Regional endpoint. To simplify connecting to and managing multiple endpoints, S3 Multi-Region Access Points create a single global endpoint spanning multiple S3 buckets in different Regions. When applications connect to this endpoint, it will route over the AWS network using AWS Global Accelerator to the bucket with the lowest latency. Failover routing is also automatically handled if the connectivity or availability to a bucket changes.

For files stored outside of Amazon S3, AWS DataSync simplifies, automates, and accelerates moving file data across Regions and accounts. It supports homogeneous and heterogeneous file migrations across Elastic File System (Amazon EFS), Amazon FSx, AWS Snowcone, and Amazon S3. It can even be used to sync on-premises files stored on NFS, SMB, HDFS, and self-managed object storage to AWS for hybrid architectures.

File and object replication should be expected to be eventually consistent. The rate at which a given dataset can transfer is a function of the amount of data, I/O bandwidth, network bandwidth, and network conditions.

Copying backups

Scheduled backups can be set up with AWS Backup, which automates backups of your data to meet business requirements. Backup plans can automate copying backups to one or more AWS Regions or accounts. A growing number of services are supported, and this can be especially useful for services that don’t offer real-time replication to another Region such as Amazon Elastic Block Store (Amazon EBS) and Amazon Neptune.

Figure 1 shows how these data transfer services can be combined for each resource.

Storage replication services

Figure 1. Storage replication services

Spanning non-relational databases across Regions

Amazon DynamoDB global tables provide multi-Region and multi-writer features to help you build global applications at scale. A DynamoDB global table is the only AWS managed offering that allows for multiple active writers in a multi-Region topology (active-active and multi-Region). This allows for applications to read and write in the Region closest to them, with changes automatically replicated to other Regions.

Global reads and fast recovery for Amazon DocumentDB (with MongoDB compatibility) can be achieved with global clusters. These clusters have a primary Region that handles write operations. Dedicated storage-based replication infrastructure enables low-latency global reads with a lag of typically less than one second.

Keeping in-memory caches warm with the same data across Regions can be critical to maintain application performance. Amazon ElastiCache for Redis offers global datastore to create a fully managed, fast, reliable, and secure cross-Region replica for Redis caches and databases. With global datastore, writes occurring in one Region can be read from up to two other cross-Region replica clusters – eliminating the need to write to multiple caches to keep them warm.

Spanning relational databases across Regions

For applications that require a relational data model, Amazon Aurora global database provides for scaling of database reads across Regions in Aurora PostgreSQL-compatible and MySQL-compatible editions. Dedicated replication infrastructure utilizes physical replication to achieve consistently low replication lag that outperforms the built-in logical replication database engines offer, as shown in Figure 2.

SysBench OLTP (write-only) stepped every 600 seconds on R4.16xlarge

Figure 2. SysBench OLTP (write-only) stepped every 600 seconds on R4.16xlarge

With Aurora global database, one primary Region is designated as the writer, and secondary Regions are dedicated to reads. Aurora MySQL supports write forwarding, which forwards write requests from a secondary Region to the primary Region to simplify logic in application code. Failover testing can happen by utilizing managed planned failover, which will change the active write cluster to another Region while keeping the replication topology intact. All databases discussed in this post employ eventual consistency when used across Regions, but Aurora PostgreSQL has an option to set the maximum a replica lag allowed with managed recovery point objective (managed RPO).

Logical replication, which utilizes a database engine’s built-in replication technology, can be set up for Amazon Relational Database Service (Amazon RDS) for MariaDB, MySQL, Oracle, PostgreSQL, and Aurora databases. A cross-Region read replica will receive these changes from the writer in the primary Region. For applications built on RDS for Microsoft SQL Server, cross-Region replication can be achieved by utilizing the AWS Database Migration Service. Cross-Region replicas allow for quicker local reads and can reduce data loss and recovery times in the case of a disaster by being promoted to a standalone instance.

For situations where a longer RPO and recovery time objective (RTO) are acceptable, backups can be copied across Regions. This is true for all of the relational and non-relational databases mentioned in this post, except for ElastiCache for Redis. Amazon Redshift can also automatically do this for your data warehouse. Backup copy times will vary depending on size and change rates.

A purpose-built database strategy offers many benefits, Figure 3 forms a purpose-built global database architecture.

Purpose-built global database architecture

Figure 3. Purpose-built global database architecture

Summary

Data is at the center of almost every application. In this post, we reviewed AWS services that offer cross-Region data replication to get your data where it needs to be quickly. Whether you need faster local reads, an active-active database, or simply need your data durably stored in a second Region, we have a solution for you. In the 3rd and final post of this series, we’ll cover application management and monitoring features.

Ready to get started? We’ve chosen some AWS Solutions, AWS Blogs, and Well-Architected labs to help you!

Related posts

ICYMI: Serverless Q4 2021

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2021/

Welcome to the 15th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

Q4 calendar

In case you missed our last ICYMI, check out what happened last quarter here.

AWS Lambda

For developers using Amazon MSK as an event source, Lambda has expanded authentication options to include IAM, in addition to SASL/SCRAM. Lambda also now supports mutual TLS authentication for Amazon MSK and self-managed Kafka as an event source.

Lambda also launched features to make it easier to operate across AWS accounts. You can now invoke Lambda functions from Amazon SQS queues in different accounts. You must grant permission to the Lambda function’s execution role and have SQS grant cross-account permissions. For developers using container packaging for Lambda functions, Lambda also now supports pulling images from Amazon ECR in other AWS accounts. To learn about the permissions required, see this documentation.

The service now supports a partial batch response when using SQS as an event source for both standard and FIFO queues. When messages fail to process, Lambda marks the failed messages and allows reprocessing of only those messages. This helps to improve processing performance and may reduce compute costs.

Lambda launched content filtering options for functions using SQS, DynamoDB, and Kinesis as an event source. You can specify up to five filter criteria that are combined using OR logic. This uses the same content filtering language that’s used in Amazon EventBridge, and can dramatically reduce the number of downstream Lambda invocations.

Amazon EventBridge

Previously, you could consume Amazon S3 events in EventBridge via CloudTrail. Now, EventBridge receives events from the S3 service directly, making it easier to build serverless workflows triggered by activity in S3. You can use content filtering in rules to identify relevant events and forward these to 18 service targets, including AWS Lambda. You can also use event archive and replay, making it possible to reprocess events in testing, or in the event of an error.

AWS Step Functions

The AWS Batch console has added support for visualizing Step Functions workflows. This makes it easier to combine these services to orchestrate complex workflows over business-critical batch operations, such as data analysis or overnight processes.

Additionally, Amazon Athena has also added console support for visualizing Step Functions workflows. This can help when building distributed data processing pipelines, allowing Step Functions to orchestrate services such as AWS Glue, Amazon S3, or Amazon Kinesis Data Firehose.

Synchronous Express Workflows now supports AWS PrivateLink. This enables you to start these workflows privately from within your virtual private clouds (VPCs) without traversing the internet. To learn more about this feature, read the What’s New post.

Amazon SNS

Amazon SNS announced support for token-based authentication when sending push notifications to Apple devices. This creates a secure, stateless communication between SNS and the Apple Push Notification (APN) service.

SNS also launched the new PublishBatch API which enables developers to send up to 10 messages to SNS in a single request. This can reduce cost by up to 90%, since you need fewer API calls to publish the same number of messages to the service.

Amazon SQS

Amazon SQS released an enhanced DLQ management experience for standard queues. This allows you to redrive messages from a DLQ back to the source queue. This can be configured in the AWS Management Console, as shown here.

Amazon DynamoDB

The NoSQL Workbench for DynamoDB is a tool to simplify designing, visualizing and querying DynamoDB tables. The tools now supports importing sample data from CSV files and exporting the results of queries.

DynamoDB announced the new Standard-Infrequent Access table class. Use this for tables that store infrequently accessed data to reduce your costs by up to 60%. You can switch to the new table class without an impact on performance or availability and without changing application code.

AWS Amplify

AWS Amplify now allows developers to override Amplify-generated IAM, Amazon Cognito, and S3 configurations. This makes it easier to customize the generated resources to best meet your application’s requirements. To learn more about the “amplify override auth” command, visit the feature’s documentation.

Similarly, you can also add custom AWS resources using the AWS Cloud Development Kit (CDK) or AWS CloudFormation. In another new feature, developers can then export Amplify backends as CDK stacks and incorporate them into their deployment pipelines.

AWS Amplify UI has launched a new Authenticator component for React, Angular, and Vue.js. Aside from the visual refresh, this provides the easiest way to incorporate social sign-in in your frontend applications with zero-configuration setup. It also includes more customization options and form capabilities.

AWS launched AWS Amplify Studio, which automatically translates designs made in Figma to React UI component code. This enables you to connect UI components visually to backend data, providing a unified interface that can accelerate development.

AWS AppSync

You can now use custom domain names for AWS AppSync GraphQL endpoints. This enables you to specify a custom domain for both GraphQL API and Realtime API, and have AWS Certificate Manager provide and manage the certificate.

To learn more, read the feature’s documentation page.

News from other services

Serverless blog posts

October

November

December

AWS re:Invent breakouts

AWS re:Invent was held in Las Vegas from November 29 to December 3, 2021. The Serverless DA team presented numerous breakouts, workshops and chalk talks. Rewatch all our breakout content:

Serverlesspresso

We also launched an interactive serverless application at re:Invent to help customers get caffeinated!

Serverlesspresso is a contactless, serverless order management system for a physical coffee bar. The architecture comprises several serverless apps that support an ordering process from a customer’s smartphone to a real espresso bar. The customer can check the virtual line, place an order, and receive a notification when their drink is ready for pickup.

Serverlesspresso booth

You can learn more about the architecture and download the code repo at https://serverlessland.com/reinvent2021/serverlesspresso. You can also see a video of the exhibit.

Videos

Serverless Land videos

Serverless Office Hours – Tues 10 AM PT

Weekly live virtual office hours. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

YouTube: youtube.com/serverlessland
Twitch: twitch.tv/aws

October

November

December

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.