Tag Archives: AWS Lambda

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

2024-07-29 Michael Greenshtein

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/monitoring-apache-iceberg-metadata-layer-using-aws-lambda-aws-glue-and-aws-cloudwatch/

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use large volumes of data. This enables more informed decision-making and innovative insights through various analytics and machine learning applications.

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. As data volumes grow, the complexity of maintaining operational excellence also increases. Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes.

This is where Apache Iceberg comes into play, offering a new approach to data lake management. Apache Iceberg is an open table format designed specifically to improve the performance, reliability, and scalability of data lakes. It addresses many of the shortcomings of traditional data lakes by providing features such as ACID transactions, schema evolution, row-level updates and deletes, and time travel.

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. Based on collected metrics, we will provide recommendations on how to improve the efficiency of Iceberg tables. Additionally, you will learn how to use Amazon CloudWatch anomaly detection feature to detect ingestion issues.

Deep dive into Iceberg’s Metadata layer

Before diving into a solution, let’s understand how the Apache Iceberg metadata layer works. The Iceberg metadata layer provides an open specification instructing integrated big data engines such as Spark or Trino how to run read and write operations and how to resolve concurrency issues. It’s crucial for maintaining inter-operability between different engines. It stores detailed information about tables such as schema, partitioning, and file organization in versioned JSON and Avro files. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Apache Iceberg metadata layer architecture diagram

History and versioning: Iceberg’s versioning feature captures every change in table metadata as immutable snapshots, facilitating data integrity, historical views, and rollbacks.

File organization and snapshot management: Metadata closely manages data files, detailing file paths, formats, and partitions, supporting multiple file formats like Parquet, Avro, and ORC. This organization helps with efficient data retrieval through predicate pushdown, minimizing unnecessary data scans. Snapshot management allows concurrent data operations without interference, maintaining data consistency across transactions.

In addition to its core metadata management capabilities, Apache Iceberg also provides specialized metadata tables—snapshots, files, and partitions—that provide deeper insights and control over data management processes. These tables are dynamically generated and provide a live view of the metadata for query purposes, facilitating advanced data operations:

Snapshots table: This table lists all snapshots of a table, including snapshot IDs, timestamps, and operation types. It enables users to track changes over time and manage version history effectively.
Files table: The files table provides detailed information on each file in the table, including file paths, sizes, and partition values. It is essential for optimizing read and write performance.
Partitions table: This table shows how data is partitioned across different files and provides statistics for each partition, which is crucial for understanding and optimizing data distribution.

Metadata tables enhance Iceberg’s functionality by making metadata queries straightforward and efficient. Using these tables, data teams can gain precise control over data snapshots, file management, and partition strategies, further improving data system reliability and performance.

Before you get started

The next section describes a packaged open source solution using Apache Iceberg’s metadata layer and AWS services to enhance monitoring across your Iceberg tables.

Before we deep dive into the suggested solution, let’s mention Iceberg MetricsReporter, which is a native way to emit metrics for Apache Iceberg. It supports two types of reports: one for commits and one for scans. The default output is log based. It produces log files as a result of commit or scan operations. To submit metrics to CloudWatch or any other monitoring tool, users need to create and configure a custom MetricsReporter implementation. MetricsReporter is supported in Apache Iceberg v1.1.0 and later versions, and customers who want to use it must enable it through Spark configuration on their existing pipelines.

The following is deployed independently and doesn’t require any configuration changes to existing data pipelines. It can immediately start monitoring all the tables within the AWS account and AWS Region where it’s deployed. This solution introduces an additional latency of metrics arrival between 20 and 80 seconds compared to MetricsReporter but offers seamless integration without the need for custom configurations or changes to current workflows.

Solution overview

This solution is specifically designed for customers who run Apache Iceberg on Amazon Simple Storage Service (Amazon S3) and use AWS Glue as their data catalog.

Key features

This solution uses an AWS Lambda deployment package to collect metrics from Apache Iceberg tables. The metrics are then submitted to CloudWatch where you can create metrics visualizations to help recognize trends and anomalies over time.

The solution is designed to be lightweight, focusing on collecting metrics directly from the Iceberg metadata layer without scanning the actual data layer. This approach significantly reduces the compute capacity required, making it efficient and cost-effective. Key features of the solution include:

Time-series metrics collection: The solution monitors Iceberg tables continuously to identify trends and detect anomalies in data ingestion rates, partition skewness, and more.
Event-driven architecture: The solution uses Amazon EventBridge to launch a Lambda function when the state of an AWS Glue Data Catalog table changes. This ensures real-time metrics collection every time a transaction is committed to an Iceberg table.
Efficient data retrieval: Incorporates minimal compute resources by utilizing AWS Glue interactive sessions and the pyiceberg library to directly access Iceberg metadata tables such as snapshots, partitions, and files.

Metrics tracked

As of the blog release date, the solution collects over 25 metrics. These metrics are categorized into several groups:

Snapshot metrics: Include total and changes in data files, delete files, records added or removed, and size changes.
Partition and file metrics: Aggregated and per-partition metrics like average, maximum, minimum record counts and file sizes, which help in understanding data distribution and help optimizing storage.

To see the complete list of metrics, go to the GitHub repository.

Visualizing data with CloudWatch dashboards

The solution also provides a sample CloudWatch dashboard to visualize the collected metrics. Metrics visualization is important for real-time monitoring and detecting operational issues. The provided helper script simplifies the set up and deployment of the dashboard.

Amazon CloudWatch dashboard

You can go to the GitHub repository to learn more about how to deploy the solution in your AWS account.

What are the vital metrics for Apache Iceberg tables?

This section discusses specific metrics from Iceberg’s metadata and explains why they’re important for monitoring data quality and system performance. The metrics are broken down into three parts: insight, challenge, and action. This provides a clear path for practical application. In this section, we provide only a subset of the available metrics that the solution can collect, for a complete list, see the solution Github page.

1. snapshot.added_data_files, snapshot.added_records

Metric insight: The number of data files and number of records added to the table during the last transaction. The ingestion rate measures the speed at which new data is added to the data lake. This metric helps identify bottlenecks or inefficiencies in data pipelines, guiding capacity planning and scalability decisions.
Challenge: A sudden drop in the ingestion rate can indicate failures in data ingestion pipelines, source system outages, configuration errors or traffic spikes.
Action: Teams need to establish real-time monitoring and alert systems to detect drops in ingestion rates promptly, allowing quick investigations and resolutions.

2. files.avg_record_count, files.avg_file_size

Metric insight: These metrics provide insights into the distribution and storage efficiency of the table. Small file sizes might suggest excessive fragmentation.
Challenge: Excessively small file sizes can indicate inefficient data storage leading to increased read operations and higher I/O costs.
Action: Implementing regular data compaction processes helps consolidate small files, optimizing storage and enhancing content delivery speeds as demonstrated by a streaming service. Data Catalog offers automatic compaction of Apache Iceberg tables. To learn more about compacting Apache Iceberg tables, see Enable compaction in Working with tables on the AWS Glue console.

3. partitions.skew_record_count, partitions.skew_file_count

Metric insight: The metrics indicate the asymmetry of the data distribution across the available table partitions. A skewness value of zero, or very close to zero, suggests that the data is balanced. Positive or negative skewness values might indicate a problem.
Challenge: Imbalances in data distribution across partitions can lead to inefficiencies and slow query responses.
Action: Regularly analyze data distribution metrics to adjust partitioning configuration. Apache Iceberg allows you to transform partitions dynamically, which enables optimization of table partitioning as query patterns or data volumes change, without impacting your existing data.

4. snapshot.deleted_records, snapshot.total_delete_files, snapshot.added_position_deletes

Metric insight: Deletion metrics in Apache Iceberg provide important information on the volume and nature of data deletions within a table. These metrics help track how often data is removed or updated, which is essential for managing data lifecycle and compliance with data retention policies.
Challenge: High values in these metrics can indicate excessive deletions or updates, which might lead to fragmentation and decreased query performance.
Action: To address these challenges, run compaction periodically to ensure deleted rows do not persist in new files. Regularly review and adjust data retention policies and consider expiring old snapshots to keep only necessary amount of data files. You can run compaction operation on specific partitions using Amazon Athena Optimize

Effective monitoring is essential for making informed decisions about necessary maintenance actions for Apache Iceberg tables. Determining the right timing for these actions is crucial. Implementing timely preventative maintenance ensures high operational efficiency of the data lake and helps to address potential issues before they become significant problems.

Using Amazon CloudWatch for anomaly detection and alerts

This section assumes that you have completed the solution setup and collected operational metrics from your Apache Iceberg tables into Amazon CloudWatch.

Now you can start setting up some alerts and detect anomalies.

We guide you on setting up the anomaly detection and configuring alerts in CloudWatch to monitor the snapshot.added_records metric, which indicates the ingestion rate of data written into an Apache Iceberg table.

Set up anomaly detection

CloudWatch anomaly detection applies machine learning algorithms to continuously analyze system metrics, determine normal baselines, and identify items that are outside of the established patterns. Here is how you configure it:

Amazon CloudWatch anomaly detection screenshot

Select Metrics: In the AWS Management Console for Cloudwatch, go to the Metrics tab and search for and select snapshot.added_records.
Create anomaly detection models: Choose the Graphed metrics tab and click the Pulse icon to enable anomaly detection.
Set Sensitivity: The second parameter of the ANOMALY_DETECTION_BAND (m1, 5) is to adjust the sensitivity of the anomaly detection. The goal is to balance detecting real issues and reducing false positives.

Configure alerts

After the anomaly detection model is set up, set up an alert to notify operations teams about potential issues:

Create alarm: Choose the bell icon under Actions on the same Graphed metrics tab.
Alarm settings: Set the alarm to notify the operations team when the snapshot.added_records metric is outside the anomaly detection band for two consecutive periods. This helps reduce the risk of false alerts.
Alarm actions: Configure CloudWatch to send an alarm email to the operations team. In addition to sending emails, CloudWatch alarm actions can automatically launch remediation processes, such as scaling operations or initiating data compaction.

Best practices

Regularly review and adjust models: As data patterns evolve, periodically review and adjust anomaly detection models and alarm settings to remain effective.
Comprehensive coverage: Ensure that all critical aspects of the data pipeline are monitored, not just a few metrics.
Documentation and communication: Maintain clear documentation of what each metric and alarm represent and ensure that your operations team understands the monitoring set up and response procedures. Set up the alerting mechanisms to send notifications through appropriate channels such as email, corporate messenger, or telephone to ensure your operations team stays informed and can quickly address the issues.
Create playbooks and automate remediation tasks: Establish detailed playbooks that describe step-by-step responses for common scenarios identified by alerts. Additionally, automate remediation tasks where possible to speed up response times and reduce the manual burden on teams. This ensures consistent and effective responses to all incidents.

CloudWatch anomaly detection and alerting features help organizations proactively manage their data lakes. This ensures data integrity, reduces downtime, and maintains high data quality. As a result, it enhances operational efficiency and supports robust data governance.

Conclusion

In this blog post, we explored Apache Iceberg’s transformative impact on data lake management. Apache Iceberg addresses the challenges of big data with features like ACID transactions, schema evolution, and snapshot isolation, enhancing data reliability, query performance, and scalability.

We delved into Iceberg’s metadata layer and related metadata tables such as snapshots, files, and partitions that allow easy access to crucial information about the current state of the table. These metadata tables facilitate the extraction of performance-related data, enabling teams to monitor and optimize the data lake’s efficiency.

Finally, we showed you a practical solution for monitoring Apache Iceberg tables using Lambda, AWS Glue, and CloudWatch. This solution uses Iceberg’s metadata layer and CloudWatch monitoring capabilities to provide a proactive operational framework. This framework detects trends and anomalies, ensuring robust data lake management.

About the Author

Avatar Michael Greenshtein is a Senior Analytics Specialist at Amazon Web Services. He is an experienced data professional with over 8 years in cloud computing and data management. Michael is passionate about open-source technology and Apache Iceberg.

AWS Weekly Roundup: Global AWS Heroes Summit, AWS Lambda, Amazon Redshift, and more (July 22, 2024)

2024-07-22 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-global-aws-heroes-summit-aws-lambda-amazon-redshift-and-more-july-22-2024/

Last week, AWS Heroes from around the world gathered to celebrate the 10th anniversary of the AWS Heroes program at Global AWS Heroes Summit. This program recognizes a select group of AWS experts worldwide who go above and beyond in sharing their knowledge and making an impact within developer communities.

Matt Garman, CEO of AWS and a long-time supporter of developer communities, made a special appearance for a Q&A session with the Heroes to listen to their feedback and respond to their questions.

Here’s an epic photo from the AWS Heroes Summit:

As Matt mentioned in his Linkedin post, “The developer community has been core to everything we have done since the beginning of AWS.” Thank you, Heroes, for all you do. Wishing you all a safe flight home.

Last week’s launches
Here are some launches that caught my attention last week:

Announcing the July 2024 updates to Amazon Corretto — The latest updates for the Corretto distribution of OpenJDK is now available. This includes security and critical updates for the Long-Term Supported (LTS) and Feature (FR) versions.

New open-source Advanced MYSQL ODBC Driver now available for Amazon Aurora and RDS — The new AWS ODBC Driver for MYSQL provides faster switchover and failover times, and authentication support for AWS Secrets Manager and AWS Identity and Access Management (IAM), making it a more efficient and secure option for connecting to Amazon RDS and Amazon Aurora MySQL-compatible edition databases.

Productionize Fine-tuned Foundation Models from SageMaker Canvas — Amazon SageMaker Canvas now allows you to deploy fine-tuned Foundation Models (FMs) to SageMaker real-time inference endpoints, making it easier to integrate generative AI capabilities into your applications outside the SageMaker Canvas workspace.

AWS Lambda now supports SnapStart for Java functions that use the ARM64 architecture — Lambda SnapStart for Java functions on ARM64 architecture delivers up to 10x faster function startup performance and up to 34% better price performance compared to x86, enabling the building of highly responsive and scalable Java applications using AWS Lambda.

Amazon QuickSight improves controls performance — Amazon QuickSight has improved the performance of controls, allowing readers to interact with them immediately without having to wait for all relevant controls to reload. This enhancement reduces the loading time experienced by readers.

Amazon OpenSearch Serverless levels up speed and efficiency with smart caching — The new smart caching feature for indexing in Amazon OpenSearch Serverless automatically fetches and manages data, leading to faster data retrieval, efficient storage usage, and cost savings.

Amazon Redshift Serverless with lower base capacity available in the Europe (London) Region — Amazon Redshift Serverless now allows you to start with a lower data warehouse base capacity of 8 Redshift Processing Units (RPUs) in the Europe (London) region, providing more flexibility and cost-effective options for small to large workloads.

AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions — AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions, enabling you to build serverless applications with Lambda functions that are invoked based on messages posted to Amazon MQ message brokers.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: AWS Summit Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Weekly Roundup: Amazon S3 Access Grants, AWS Lambda, European Sovereign Cloud Region, and more (July 8, 2024).

2024-07-08 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-s3-access-grants-aws-lambda-european-sovereign-cloud-region-and-more-july-8-2024/

I counted only 21 AWS news since last Monday, most of them being Regional expansions of existing services and capabilities. I hope you enjoyed a relatively quiet week, because this one will be busier.

This week, we’re welcoming our customers and partners at the Jacob Javits Convention Center for the AWS Summit New York on Wednesday, July 10. I can tell you there is a stream of announcements coming, if I judge by the number of AWS News Blog posts ready to be published.

I am writing these lines just before packing my bag to attend the AWS Community Day in Douala, Cameroon next Saturday. I can’t wait to meet our customers and partners, students, and the whole AWS community there.

But for now, let’s look at last week’s new announcements.

Last week’s launches
Here are the launches that got my attention.

Amazon Simple Storage Service (Amazon S3) Access Grants now integrate with Amazon SageMaker and open souce Python frameworks – Amazon S3 Access Grants maps identities in directories such as Active Directory or AWS Identity and Access Management (IAM) principals, to datasets in S3. The integration with Amazon SageMaker Studio for machine learning (ML) helps you map identities to your machine learning (ML) datasets in S3. The integration with the AWS SDK for Python (Boto3) plugin replaces any custom code required to manage data permissions, so you can use S3 Access Grants in open source Python frameworks such as Django, TensorFlow, NumPy, Pandas, and more.

AWS Lambda introduces new controls to make it easier to search, filter, and aggregate Lambda function logs – You can now capture your Lambda logs in JSON structured format without bringing your own logging libraries. You can also control the log level (for example, ERROR, DEBUG, or INFO) of your Lambda logs without making any code changes. Lastly, you can choose the Amazon CloudWatch log group to which Lambda sends your logs.

Amazon DataZone introduces fine-grained access control – Amazon DataZone has introduced fine-grained access control, providing data owners granular control over their data at row and column levels. You use Amazon DataZone to catalog, discover, analyze, share, and govern data at scale across organizational boundaries with governance and access controls. Data owners can now restrict access to specific records of data instead of granting access to an entire dataset.

AWS Direct Connect proposes native 400 Gbps dedicated connections at select locations – AWS Direct Connect provides private, high-bandwidth connectivity between AWS and your data center, office, or colocation facility. Native 400 Gbps connections provide higher bandwidth without the operational overhead of managing multiple 100 Gbps connections in a link aggregation group. The increased capacity delivered by 400 Gbps connections is particularly beneficial to applications that transfer large-scale datasets, such as for ML and large language model (LLM) training or advanced driver assistance systems for autonomous vehicles.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items that you might find interesting:

The list of services available at launch in the upcoming AWS Europe Sovereign Cloud Region is available – we shared the list of AWS services that will be initially available at launch in the new AWS European Sovereign Cloud Region. The list has no surprises. Services for security, networking, storage, computing, containers, artificial intelligence (AI), and serverless will be available at launch. We are building the AWS European Sovereign Cloud to offer public sector organizations and customers in highly regulated industries further choice to help them meet their unique digital sovereignty requirements, as well as stringent data residency, operational autonomy, and resiliency requirements. This is an investment of 7.8 billion euros (approximately $8.46 billion). The new Region will be available by the end of 2025.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: New York (July 10), Bogotá (July 18), and Taipei (July 23–24).

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Cameroon (July 13), Aotearoa (August 15), and Nigeria (August 24).

Browse all upcoming AWS led in-person and virtual events and developer-focused events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Refactoring to Serverless: From Application to Automation

2024-07-03 Sindhu Pillai

Post Syndicated from Sindhu Pillai original https://aws.amazon.com/blogs/devops/refactoring-to-serverless-from-application-to-automation/

Serverless technologies not only minimize the time that builders spend managing infrastructure, they also help builders reduce the amount of application code they need to write. Replacing application code with fully managed cloud services improves both the operational characteristics and the maintainability of your applications thanks to a cleaner separation between business logic and application topology. This blog post shows you how.

Serverless isn’t a runtime; it’s an architecture

Since the launch of AWS Lambda in 2014, serverless has evolved to be more than just a cloud runtime. The ability to easily deploy and scale individual functions, coupled with per-millisecond billing, has led to the evolution of modern application architectures from monoliths towards loosely-coupled applications. Functions typically communicate through events, an interaction model that’s supported by a combination of serverless integration services, such as Amazon EventBridge and Amazon SNS, and Lambda’s asynchronous invocation model.

Modern distributed architectures with independent runtime elements (like Lambda functions or containers) have a distinct topology graph that represents which elements talk to others. In the diagram below, Amazon API Gateway, Lambda, EventBridge, and Amazon SQS interact to process an order in a typical Order Processing System. The topology has a major influence on the application’s runtime characteristics like latency, throughput, or resilience.

The role of cloud automation evolves

Cloud automation languages, commonly referred to as IaC (Infrastructure as Code), date back to 2011 with the launch of CloudFormation, which allowed users to declare a set of cloud resources in configuration files instead of issuing a series of API calls or CLI commands. Initial document-oriented automation languages like AWS CloudFormation and Terraform were soon complemented by frameworks like AWS Cloud Development Kit (CDK), CDK for Terraform, and Pulumi that introduced the ability to write cloud automation code in popular general-purpose languages like TypeScript, Python, or Java.

The role of cloud automation evolved alongside serverless application architectures. Because serverless technologies free builders from having to manage infrastructure, there really isn’t any “I” in serverless IaC anymore. Instead, serverless cloud automation primarily defines the application’s topology by connecting Lambda functions with event sources or targets, which can be other Lambda functions. This approach more closely resembles “AaC” – Architecture as Code – as the automation now defines the application’s architecture instead of provisioning infrastructure elements.

Improving serverless applications with automation code

By utilizing AWS serverless runtime features, automation code can frequently achieve the same functionality as your application code.

For example, the Lambda function below, written in TypeScript, sends a message to EventBridge:

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => { 
    const result = // some logic
    const eventParam = new PutEventsCommand({
        Entries: [
            {
              Detail: JSON.stringify(result),
              DetailType: 'OrderCreated',
              EventBusName: process.env.EVENTBUS_NAME,
            }
          ]
    });
    await eventBridgeClient.send(eventParam);     return {
       statusCode: 200,
       body: JSON.stringify({ message: 'Order created', result }),
    };
};

You can achieve the same behavior using AWS Lambda Destinations, which instructs the Lambda runtime to publish an event after the completion of the function. You can configure Lambda destinations via below AWS CDK code, also written in TypeScript:

import {EventBridgeDestination} from "aws-cdk-lib/aws-lambda-destinations"

const createOrderLambda = new Function(this,'createOrderLambda', {
    functionName: `OrderService`,
    runtime: Runtime.NODEJS_20_X,
    code: Code.fromAsset('lambda-fns/send-message-using-destination'),
    handler: 'OrderService.handler',
 onSuccess: new EventBridgeDestination(eventBus)
});

With the AWS CDK, you can use the same programming languages for both application and automation code, allowing you to switch easily between the two.

The Lambda function can now focus on the business logic and doesn’t contain any reference to message sending or EventBridge. This separation of concerns is a best practice because changes to the business logic do not run the risk of breaking the architecture and vice versa.

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
    const result = //some logic
    return {
        statusCode: 200,
        body: JSON.stringify({ message: 'Order created', result }),
     };
};

Instructing the serverless Lambda runtime to send the event has several advantages over hand-coding it inside the application code

It decouples application logic from topology. The message destination, consisting of the type of the service (e.g., EventBridge vs. another Lambda Function) and the destination’s ARN, define the application’s architecture (or topology). Embedding message sending in the application code mixes architecture with business logic. Handling the sending of the message in the runtime separates concerns and avoids having to touch the application code for a topology change.
It makes the composition explicit. If application code sends a message, it will likely read the destination from an environment variable, which is passed to the Lambda function. The name of the variable that is used for this purpose is buried in the application code, forcing you to rely on naming conventions. Defining all dependencies between service instances in automation code keeps them in a central location, and allows you to use code analysis and refactoring tools to reason about your architecture or make changes to it.
It avoids simple mistakes. Redundant code can lead to mistakes. For example, debugging a Lambda function that accidentally swapped day and month in the message’s date field took hours. Letting the runtime send messages avoids such errors.
Higher-level constructs simplify permission grants. Cloud automation libraries like CDK allow the creation of higher-level constructs, which can combine multiple resources and include necessary IAM permissions. You’ll write less code and avoid debugging cycles.
The runtime is more robust. Delegating message sending to the serverless runtime takes care of any required retries, ensuring the message to be sent and freeing builders from having to write extra code for such undifferentiated heavy lifting.

In summary, letting the managed service handle message passing makes your serverless application cleaner and more robust. We also like to say that it becomes “serverless-native” because it fully utilizes the native services available to the application.

Refactoring to serverless-native

Shifting code from application to automation is what we call “Refactoring to Serverless”. Refactoring is a term popularized by Martin Fowler in the late 90s to describe the restructuring of source code to alter its structure without changing its external behavior. Code refactoring can be as simple as extracting code into a separate method or more sophisticated like replacing conditional expressions with polymorphism.

Developers refactor their code to improve its readability and maintainability. A common approach in Test-Driven Development (TDD) is the so-called red-green-refactor cycle: write a test, which will be red because the functionality isn’t implemented, then write the code to make the test green, and finally refactor to counteract the growing entropy in the codebase.

Serverless refactoring takes inspiration from this concept but augments it to the context of serverless automation:

Serverless refactoring: A controlled technique for improving the design of serverless applications by replacing application code with equivalent automation code.

Let’s explore how serverless refactoring can enhance the design and runtime characteristics of a serverless application. The diagram below shows an AWS Step Functions workflow that performs a quality check through image recognition. An early implementation, shown on the left, would use an intermediate AWS Lambda function to call the Amazon Rekognition service. Thanks to the launch of Step Functions’ AWS SDK service integrations in 2021, you can refactor the workflow to directly call the Rekognition API. This refactored design, seen on the right, eliminates the Lambda function (assuming it didn’t perform any additional tasks), thereby reducing costs and runtime complexity.

Replacing Lambda with Service Integration in Step Function workflow

See the AWS CDK implementation for this refactoring, in TypeScript, on GitHub.

Refactoring Limitations

The initial example of replacing application code to send a message to SQS via Lambda Destinations reveals that refactoring from application to automation code isn’t 100% behavior-preserving.

First, Lambda Destinations are only triggered when the function is invoked asynchronously. For synchronous invocations, the function passes the results back to the caller, and does not invoke the destination. Second, the serverless runtime wraps the data returned from the function inside a message envelope, affecting how the message recipient parses the JSON object. The message data is placed inside the responsePayload field if sending to another Lambda function or the detail field if sending to an EventBridge destination. Last, Lambda Destinations sends a message after the function completes, whereas application code could send the message at any point during the execution.

Lambda Destination Execution

The last change in behavior will be transparent to well-architected asynchronous applications because they won’t depend on the timing of message delivery. If a Lambda function continues processing after sending a message (for example, to EventBridge), that code can’t assume that the message has been processed because delivery is asynchronous. A rare exception could be a loop waiting for the results from the downstream message processing, but such loops violate the principles of asynchronous integration and also waste compute resources (Amazon Step Functions is a great choice for asynchronous callbacks). If such behavior is required, it can be achieved by splitting the Lambda function into two parts.

Can Serverless Refactoring be Automated?

Traditional code refactoring like “Extract Method” is automated thanks to built-in support by many code editors. Serverless refactoring isn’t (yet) a fully automatic, 100%-equivalent code transformation because it translates application code into automation code (or vice versa). While AI-powered tools like Amazon Q Developer are getting us closer to that vision, we consider serverless refactoring primarily as a design technique for developers to better utilize the AWS runtime. Improved code design and runtime characteristics outweigh behavior differences, especially if your application includes automated tests.

Incorporating refactoring into your team structures

If a single team owns both the application and the automation code, refactoring takes place inside the team. However, serverless refactoring can cross team boundaries when separate teams develop business logic versus managing the underlying infrastructure, configuration, and deployment.

In such a model, AWS recommends that the development team be responsible for both the application code and the application-specific automation, such as the CDK code to configure Lambda Destinations, Step Functions workflows, or EventBridge routing. Splitting application and application-specific automation across teams would make the development team dependent on the platform team for each refactoring and introduce unnecessary friction.

If both teams use the same Infrastructure-as-Code (IaC) tool, say AWS CDK, the platform team can build reusable templates and constructs that encapsulate organizational requirements and guardrails, such as CDK constructs for S3 buckets with encryption enabled. Development teams can easily consume those resources across CDK stacks.

However, teams could use different IaC tools, for example, the infrastructure team prefers CloudFormation but the development team prefers AWS CDK. In this setup, development teams can build their automation on top of the CFN Modules provided by the infrastructure team. However, they won’t benefit from the same high-level programming abstractions as they do with CDK.

Collaboration in a split-team model

Continuous Refactoring

Just like traditional code refactoring, refactoring to serverless isn’t a one-time activity but an essential aspect of your software delivery. Because adding functionality increases your application’s complexity, regular refactoring can help keep complexity at bay and maintain your development velocity. Like with Continuous Delivery, you can improve your software delivery with Continuous Refactoring.

Teams who encounter difficulties with serverless refactoring might be lacking automated test coverage or cloud automation. So, refactoring can become a useful forcing function for teams to exercise software delivery hygiene, for example by implementing automated tests.

Getting Started

The refactoring samples discussed here are a subset of an extensive catalog of open source code examples, which you can find along with AWS CDK implementation examples at refactoringserverless.com. You can also dive deeper into how serverless refactoring can make your application architecture more loosely coupled in a separate blog post.

Use the examples to accelerate your own refactoring effort. Now Go Refactor!

Serverless ICYMI Q2 2024

2024-07-02 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-q2-2024/

Welcome to the 26th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

Calendar

EDA Day – London 2024

The AWS Serverless DA team hosted the third Event-Driven Architecture (EDA) Day in London on May 14th. This event brought together prominent figures in the event-driven architecture community, AWS, and customer speakers.

EDA Day covered 13 sessions, 2 workshops, and a Q&A panel. David Boyne was the keynote speaker with a talk “Complexity is the Gotcha of Event-Driven Architecture”. There were AWS speakers including Matthew Meckes, Natasha Wright, Julian Wood, Gillian Amstrong, Josh Kahn, Veda Ramen, and Uma Ramadoss. There was also an impressive lineup of guest speakers, Daniele Frasca, David Anderson, Ryan Cormack, Sarah Hamilton, Sheen Brisals, Marcin Sodkiewicz, and Ben Ellerby.

Videos are available on YouTube

EDA Day London

The future of Serverless

There has been a lot of talk about the future of serverless, with this year being the 10^th anniversary of AWS Lambda. Eric Johnson addresses the topic in his ServerlessDays Milan keynote, “Now serverless is all grown up, what’s next”.

AWS Lambda

AWS launched support for the latest release of Ruby 3.3 is based on the new Amazon Linux 2023 runtime. The Ruby 3.3 runtime also provides access to the latest Ruby language features.

There is a new guide on how to retrieve data about Lambda functions that use a deprecated runtime.

Learn how to run code after returning a response from an AWS Lambda function. This post shows how to return a synchronous function response as soon as possible, yet also perform additional asynchronous work after you send the response. For example, you may store data in a database or send information to a logging system.

See how you can use the circuit-breaker pattern with Lambda extensions and Amazon DynamoDB. The circuit breaker pattern can help prevent cascading failures and improve overall system stability.

Circuit-breaker pattern

Lambda functions now scale up to 12X faster in the AWS GovCloud (US) Regions.

Powertools for AWS Lambda (Python) adds support for Agents for Amazon Bedrock.

The AWS SDK for JavaScript v2 enters maintenance mode on September 8, 2024 and reaches end-of-support on September 8, 2025.

Amazon CloudWatch Logs introduced Live Tail streaming CLI support.

Amazon ECS and AWS Fargate

You can now secure Amazon Elastic Container Service (Amazon ECS) workloads on AWS Fargate with customer managed keys (CMKs). Once you add your keys to AWS Key Management Service (AWS KMS), you can use these to encrypt the underlying ephemeral storage of an Amazon ECS task on AWS Fargate.

Windows containers on AWS Fargate now start faster, up to 42% for Windows Server 2022 Core. AWS has optimized the Windows Server AMIs, introduced EC2 fast launch with pre-provisioned snapshots, and reduced network latency.

Amazon ECS Service Connect is a networking capability to simplify service discovery, connectivity, and traffic observability for Amazon ECS. You can now proactively scale Amazon ECS services by using custom metrics.

ECS Service Connect custom metrics

AWS Step Functions

The AWS Step Functions TestState API allows you to test individual states independently and to integrate testing into your preferred development workflows. Learn how to accelerate workflow development to iterate faster.

Step Functions TestState API

Amazon EventBridge

Amazon EventBridge Pipes now supports event delivery through AWS PrivateLink. You can send events from an event source located in an Amazon Virtual Private Cloud (VPC) to a Pipes target without traversing the public internet.

Amazon Timestream for LiveAnalytics is now an EventBridge Pipes target. Timestream for LiveAnalytics is a fast, scalable, purpose-built time series database that makes it easy to store and analyze trillions of time series data points per day.

EventBridge has a new console dashboard which provides a centralized view of your resources, metrics, and quotas. The console has an improved Learn page and other console enhancements. When using the CloudFormation template export for Pipes, you can also generate the IAM role. There is a new Rules tab in the Event Bus detail page, and the monitoring tab in the Rule detail page now includes additional metrics.

EventBridge Scheduler has some new API request metrics for improved observability.

Generative AI

Amazon Bedrock is a fully managed Generative AI service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API. Bedrock now supports new models, including Anthropic’s Claude 3.5, AI21 Labs’ Jamba-Instruct, Amazon Titan Text Premier.

The new Bedrock Converse API provides a consistent way to invoke Amazon Bedrock models and simplifies multi-turn conversations. There is also a JavaScript tutorial to walk you through sending requests to the Converse API using the Javascript SDK.

Amazon Q Developer is now generally available. Amazon Q Developer, part of the Amazon Q family, is a generative AI–powered assistant for software development. Amazon Q is available in the AWS Management Console and as an integrated development environment (IDE) extension for Visual Studio Code, Visual Studio, and JetBrains IDEs. Amazon Q Developer has knowledge of your AWS account resources and can help understand your costs.

Amazon Q list Lambda functions

You can use Amazon Q Developer to develop code features and transform code to upgrade Java applications. Amazon Q Developer also offers inline completions in the command line. For more information, see Reimagining software development with the Amazon Q Developer Agent.

Amazon Q code features

Knowledge Bases for Amazon Bedrock now let you configure Guardrails, configure inference parameters, and offers observability logs.

Storage and data

Amazon S3 no longer charges for several HTTP error codes if initiated from outside your individual AWS account or AWS Organization.

You can automatically detect malware in new object uploads to S3 with Amazon GuardDuty.

Amazon Elastic File System (Amazon EFS) now support up to 1.5 GiB/s of throughput per client, a 3x increase over the previous limit of 500 MiB/s.

Discover architectural patterns for real-time analytics using Amazon Kinesis Data Streams in part 1 and part 2 and see how to optimize write throughput.

Amazon API Gateway

Amazon API Gateway now allows you to increase the integration timeout beyond the prior limit of 29 seconds. You can raise the integration timeout for Regional and private REST APIs, but this might require a reduction in your account-level throttle quota limit. This launch can help with workloads that require longer timeouts, such as Generative AI use cases with Large Language Models (LLMs).

You can also now use Amazon Verified Permissions to secure API Gateway REST APIs when using an Open ID connect (OIDC) compliant identity provider. You can now control access based on user attributes and group memberships, without writing code.

AWS AppSync

You can now invoke your AWS AppSync data sources in an event-driven manner. Previously, you could only invoke Lambda functions synchronously from AWS AppSync. AWS AppSync can now trigger Lambda functions in Event mode, asynchronously decoupling the API response from the Lambda invocation, which helps with long-running operations.

AWS AppSync now passes application request headers to Lambda custom authorizer functions. You can make authorization decisions based on the value of the authorization header, and the value of other headers that were sent with the request from the application client.

Learn best practices for AWS AppSync GraphQL APIs. See how to how to optimize the security, performance, coding standards, and deployment of your AWS AppSync API. AWS AppSync also has increase quotas, and new metrics

AWS Amplify

AWS Amplify Gen 2 is now generally available. This now provides a code-first developer experience for building full-stack apps using TypeScript. Amplify Gen 2 allows you to express app requirements like the data models, business logic, and authorization rules in TypeScript.

AWS Amplify Gen2

Amplify has a new experience for file storage. This post explores using Lambda to create serverless functions for Amplify using TypeScript. There are also new team environment workflows.

Serverless blog posts

April

May

June

Securing Amazon ECS workloads on AWS Fargate with customer managed keys

Serverless container blog posts

April

May

Windows Containers on AWS Fargate: Launch time improvements

June

Proactive scaling of Amazon ECS services using Amazon ECS Service Connect Metrics

Serverless Office Hours

April

Apr 2 – Building Serverless Applications with Terraform
Apr 9 – Developing with Wing Cloud
Apr 16 – Combining serverless messaging services
Apr 23 – Real-time web and mobile backends
Apr 30 – Connecting Confluent to AWS

May

May 7 – Develop and test locally with LocalStack
May 14 – Building a personalized GenAI webapp
May 21 – Serverless GenAI using Bedrock Claude 3
May 28 – Serverless Platform Engineering

June

June 4 – Simplifying serverless with the CDK
June 11 – Learn Serverless with Educloud Academy
June 18 – Integrating time-series databases
June 25 – Deploy frontends with the CloudFront Hosting Toolkit

Containers from the Couch

April

Apr 11 – Using Amazon Q to build and operate your ECS workloads
April 25 Containers in AWS Lambda

May

May 9 – OPA on AWS

FooBar Serverless

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on X (formerly Twitter) to see the latest news, follow conversations, and interact with the team.

Eric Johnson: @edjgeek
Julian Wood: @julian_wood
Marcia Villalba: @mavi888uy
Olly Pomeroy @oliver-p
Romain Jourdan: @rjourdan_net

And finally, visit the Serverless Land and Containers on AWS websites for all your serverless and serverless container needs.

Best practices working with self-hosted GitHub Action runners at scale on AWS

2024-06-25 Shilpa Sharma

Post Syndicated from Shilpa Sharma original https://aws.amazon.com/blogs/devops/best-practices-working-with-self-hosted-github-action-runners-at-scale-on-aws/

Overview

GitHub Actions is a continuous integration and continuous deployment platform that enables the automation of build, test and deployment activities for your workload. GitHub Self-Hosted Runners provide a flexible and customizable option to run your GitHub Action pipelines. These runners allow you to run your builds on your own infrastructure, giving you control over the environment in which your code is built, tested, and deployed. This reduces your security risk and costs, and gives you the ability to use specific tools and technologies that may not be available in GitHub hosted runners. In this blog, I explore security, performance and cost best practices to take into consideration when deploying GitHub Self-Hosted Runners to your AWS environment.

Best Practices

Understand your security responsibilities

GitHub Self-hosted runners, by design, execute code defined within a GitHub repository, either through the workflow scripts or through the repository build process. You must understand that the security of your AWS runner execution environments are dependent upon the security of your GitHub implementation. Whilst a complete overview of GitHub security is outside the scope of this blog, I recommended that before you begin integrating your GitHub environment with your AWS environment, you review and understand at least the following GitHub security configurations.

Federate your GitHub users, and manage the lifecycle of identities through a directory.
Limit administrative privileges of GitHub repositories, and restrict who is able to administer permissions, write to repositories, modify repository configurations or install GitHub Apps.
Limit control over GitHub Actions runner registration and group settings
Limit control over GitHub workflows, and follow GitHub’s recommendations on using third-party actions
Do not allow public repositories access to self-hosted runners

Reduce security risk with short-lived AWS credentials

Make use of short-lived credentials wherever you can. They expire by default within 1 hour, and you do not need to rotate or explicitly revoke them. Short lived credentials are created by the AWS Security Token Service (STS). If you use federation to access your AWS account, assume roles, or use Amazon EC2 instance profiles and Amazon ECS task roles, you are using STS already!

In almost all cases, you do not need long-lived AWS Identity and Access Management (IAM) credentials (access keys) even for services that do not “run” on AWS – you can extend IAM roles to workloads outside of AWS without requiring you to manage long-term credentials. With GitHub Actions, we suggest you use OpenID Connect (OIDC). OIDC is a decentralized authentication protocol that is natively supported by STS using sts:AssumeRoleWithWebIdentity, GitHub and many other providers. With OIDC, you can create least-privilege IAM roles tied to individual GitHub repositories and their respective actions. GitHub Actions exposes an OIDC provider to each action run that you can utilize for this purpose.

Short lived AWS credentials with GitHub self-hosted runners

If you have many repositories that you wish to grant an individual role to, you may run into a hard limit of the number of IAM roles in a single account. While I advocate solving this problem with a multi-account strategy, you can alternatively scale this approach by:

using attribute based access control (ABAC) to match claims in the GitHub token (such as repository name, branch, or team) to the AWS resource tags.
using role based access control (RBAC) by logically grouping the repositories in GitHub into Teams or applications to create fewer subset of roles.
use an identity broker pattern to vend credentials dynamically based on the identity provided to the GitHub workflow.

Use Ephemeral Runners

Configure your GitHub Action runners to run in “ephemeral” mode, which creates (and destroys) individual short-lived compute environments per job on demand. The short environment lifespan and per-build isolation reduces the risk of data leakage , even in multi-tenanted continuous integration environments, as each build job remains isolated from others on the underlying host.

As each job runs on a new environment created on demand, there is no need for a job to wait for an idle runner, simplifying auto-scaling. With the ability to scale runners on demand, you do not need to worry about turning build infrastructure off when it is not needed (for example out of office hours), giving you a cost-efficient setup. To optimize the setup further, consider allowing developers to tag workflows with instance type tags and launch specific instance types that are optimal for respective workflows.

There are a few considerations to take into account when using ephemeral runners:

A job will remain queued until the runner EC2 instance has launched and is ready. This can take up to 2 minutes to complete. To speed up this process, consider using an optimized AMI with all prerequisites installed.
Since each job is launched on a fresh runner, utilizing caching on the runner is not possible. For example, Docker images and libraries will always be pulled from source.

Use Runner Groups to isolate your runners based on security requirements

By using ephemeral runners in a single GitHub runner group, you are creating a pool of resources in the same AWS account that are used by all repositories sharing this runner group. Your organizational security requirements may dictate that your execution environments must be isolated further, such as by repository or by environment (such as dev, test, prod).

Runner groups allow you to define the runners that will execute your workflows on a repository-by-repository basis. Creating multiple runner groups not only allow you to provide different types of compute environments, but allow you to place your workflow executions in locations within AWS that are isolated from each other. For example, you may choose to locate your development workflows in one runner group and test workflows in another, with each ephemeral runner group being deployed to a different AWS account.

Runners by definition execute code on behalf of your GitHub users. At a minimum, I recommend that your ephemeral runner groups are contained within their own AWS account and that this AWS account has minimal access to other organizational resources. When access to organizational resources is required, this can be given on a repository-by-repository basis through IAM role assumption with OIDC, and these roles can be given least-privilege access to the resources they require.

Optimize runner start up time using Amazon EC2 warm-pools

Ephemeral runners provide strong build isolation, simplicity and security. Since the runners are launched on demand, the job will be required to wait for the runner to launch and register itself with GitHub. While this usually happens in under 2 minutes, this wait time might not be acceptable in some scenarios.

We can use a warm pool of pre-registered ephemeral runners to reduce the wait time. These runners will listen to the incoming GitHub workflow events actively and as soon as an incoming workflow event is queued, it is picked up readily by the warm pool of registered EC2 runners.

While there can be multiple strategies to manage the warm pool, I recommend the following strategy which uses AWS Lambda for scaling up and scaling down the ephemeral runners:

GitHub self-hosted runners warm pool flow

A GitHub workflow event is created on a trigger like push of code in a master repository or a merge of pull request. This event triggers a Lambda function via webhook and Amazon API Gateway endpoint. The Lambda function helps in validating the GitHub workflow event payload and log events for observability & building metrics. It can be used optionally to replenish the warm pool. There are separate backend Lambda functions to launch, scale up and scale down the warm pool of EC2 instances. The EC2 instances or runners are registered with GitHub at the time of launch. The registered runners listens for incoming GitHub work flow events using GitHub’s internal job queue and as soon as workflow events are triggered, its assigned by GitHub to one of the runners in warm pool for job execution. The runner is automatically de-registered once the job completes. A job can be a build, or deploy request as defined in your GitHub workflow.

With warm pool in place, it is expected to help reduce wait time by 70-80%.

Considerations

Increased complexity as there is a possibility of over provisioning runners. This will depend on how long a runner EC2 instance requires to launch and reach a ready state and how frequently the scale up Lambda is configured to run. For example, if the scale up Lambda runs every 1 minute and the EC2 runner requires 2 minutes to launch, then the scale up Lambda will launch 2 instances. The mitigation is to use Auto scaling groups to manage the EC2 warm pool and desired capacity with predictive scaling policies tying back to incoming GitHub workflow events i.e. build job requests.
This strategy may have to be revised when supporting Windows or Mac based runners given the spin up times can vary.

Use an optimized AMI to speed up the launch of GitHub self-hosted runners

Amazon Machine Images (AMI) provide a pre-configured, optimized image that can be used to launch the runner EC2 instance. By using AMIs, you will be able to reduce the launch time of a new runner since dependencies and tools are already installed. Consistency across builds is guaranteed due to all instances running the same version of dependencies and tools. Runners will benefit from increased stability and security compliance as images are tested and approved before being distributed for use as runner instances.

When building an AMI for use as a GitHub self-hosted runner the following considerations need to be made:

Choosing the right OS base image for the builds. This will depend on your tech stack and toolset.
Install the GitHub runner app as part of the image. Ensure automatic runner updates are enabled to reduce the overhead of managing running versions. In case a specific runner version must be used you can disable automatic runner updates to avoid untested changes. Keep in mind, if disabled, a runner will need to be updated manually within 30 days of a new version becoming available.
Install build tools and dependencies from trusted sources.
Ensure runner logs are captured and forwarded to your security information and event management (SIEM) of choice.
The runner requires internet connectivity to access GitHub. This may require configuring proxy settings on the instance depending on your networking setup.
Configure any artifact repositories the runner requires. This includes sources and authentication.
Automate the creation of the AMI using tools such as EC2 Image Builder to achieve consistency.

Use Spot instances to save costs

The cost associated with scaling up the runners as well as maintaining a hot pool can be minimized using Spot Instances, which can result in savings up to 90% compared to On-Demand prices. However, there could be requirements where we can have longer running builds or batch jobs that cannot tolerate the spot instance terminating on 2 minutes notice. So, having a mixed pool of instances will be a good option where such jobs should be routed to on-demand EC2 instances and the rest on the Spot instances to cater for diverse build needs. This can be done by assigning labels to the runner during launch /registration. In that case, the on-demand instances will be launched and we can a savings plan in place to get cost benefits.

Record runner metrics using Amazon CloudWatch for Observability

It is vital for the observability of the overall platform to generate metrics for the EC2 based GitHub self-hosted runners. Examples of the GitHub runners metrics can be: the number of GitHub workflow events queued or completed in a minute, or number of EC2 runners up and available in the warm pool etc.

We can log the triggered workflow events and runner logs in Amazon CloudWatch and then use CloudWatch embedded metrics to collect metrics such as number of workflow events queued, in progress and completed. Using elements like “started_at” and “completed_at” timings which are part of workflow event payload we can calculate build wait time.

As an example, below is the sample incoming GitHub workflow event logged in Amazon Cloud Watch Logs

<p> </p><p><code>{</code></p><p><code>"hostname": "xxx.xxx.xxx.xxx",</code></p><p><code>"requestId": "aafddsd55-fgcf555",</code></p><p><code>"date": "2022-10-11T05:50:35.816Z",</code></p><p><code>"logLevel": "info",</code></p><p><code>"logLevelId": 3,</code></p><p><code>"filePath": "index.js",</code></p><p><code>"fullFilePath": "/var/task/index.js",</code></p><p><code>"fileNa<a class="ab-item" href="https://aws-blogs-prod.amazon.com/devops/" aria-haspopup="true">AWS DevOps Blog</a>me": "index.js",</code></p><p><code>"lineNumber": 83889,</code></p><p><code>"columnNumber": 12,</code></p><p><code>"isConstructor": false,</code></p><p><code>"functionName": "handle",</code></p><p><code>"argumentsArray": [</code></p><p><code>"Processing Github event",</code></p><p><code>"{\"event\":\"workflow_job\",\"repository\":\"testorg-poc/github-actions-test-repo\",\"action\":\"queued\",\"name\":\"jobname-buildanddeploy\",\"status\":\"queued\",\"started_at\":\"2022-10-11T05:50:33Z\",\"completed_at\":null,\"conclusion\":null}"</code></p><p><code>]</code></p><p><code>}</code></p>

In order to use the logged elements of above log into metrics by capturing \”status\”:\”queued\”,\”repository\”:\”testorg-poc/github-actions-test-repo\c, \”name\”:\”jobname-buildanddeploy\” ,and workflow \”event\” , one can build embedded metrics in the application code or AWS metrics Lambda using any of the cloud watch metrics client library Creating logs in embedded metric format using the client libraries – Amazon CloudWatch based on the language of your choice listed.c

Essentially what one of those libraries will do under the hood is map elements from Log event into dimension fields so cloud watch can then read and generate a metric using that.

console.log(<br />      JSON.stringify({<br />        message: '[Embedded Metric]', // Identifier for metric logs in CW logs<br />        build_event_metric: 1, // Metric Name and value<br />        status: `${status}`, // Dimension name and value<br />        eventName: `${eventName}`,<br />        repository: `${repository}`,<br />        name: `${name}`,<br />        <br />        _aws: {<br />          Timestamp: Date.now(),<br />          CloudWatchMetrics: [<br />            {<br />              Namespace: `demo_2`,<br />              Dimensions: [['status','eventName','repository','name']],<br />              Metrics: [<br />                {<br />                  Name: 'build_event_metric',<br />                  Unit: 'Count',<br />                },<br />              ],<br />            },<br />          ],<br />        },<br />      })<br />    );

A sample architecture:

Consumption of GitHub webhook events

The cloud watch metrics can be published to your dashboards or forwarded to any external tool based on requirements. Once we have metrics, CloudWatch alarms and notifications can be configured to manage pool exhaustion.

Conclusion

In this blog post, we outlined several best practices covering security, scalability and cost efficiency when using GitHub Actions with EC2 self-hosted runners on AWS. We covered how using short-lived credentials combined with ephemeral runners will reduce security and build contamination risks. We also showed how runners can be optimized for faster startup and job execution AMIs and warm EC2 pools. Last but not least, cost efficiencies can be maximized by using Spot instances for runners in the right scenarios.

Resources:

Disaster recovery strategies for Amazon MWAA – Part 2

2024-06-17 Chandan Rupakheti

Post Syndicated from Chandan Rupakheti original https://aws.amazon.com/blogs/big-data/disaster-recovery-strategies-for-amazon-mwaa-part-2/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. Amazon MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although Amazon MWAA provides high availability within an AWS Region through features like Multi-AZ deployment of Airflow components, recovering from a Regional outage requires a multi-Region deployment.

In Part 1 of this series, we highlighted challenges for Amazon MWAA disaster recovery and discussed best practices to improve resiliency. In particular, we discussed two key strategies: backup and restore and warm standby. In this post, we dive deep into the implementation for both strategies and provide a deployable solution to realize the architectures in your own AWS account.

The solution for this post is hosted on GitHub. The README in the repository offers tutorials as well as further workflow details for both backup and restore and warm standby strategies.

Backup and restore architecture

The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. The backups are replicated to an S3 bucket in the secondary Region. In case of a failure in the primary Region, a new Amazon MWAA environment is created in the secondary Region and hydrated with the backed-up metadata to restore the workflows.

The project uses the AWS Cloud Development Kit (AWS CDK) and is set up like a standard Python project. Refer to the detailed deployment steps in the README file to deploy it in your own accounts.

The following diagram shows the architecture of the backup and restore strategy and its key components:

Primary Amazon MWAA environment – The environment in the primary Region hosts the workflows
Metadata backup bucket – The bucket in the primary Region stores periodic backups of Airflow metadata tables
Replicated backup bucket – The bucket in the secondary Region syncs metadata backups through Amazon S3 cross-Region replication
Secondary Amazon MWAA environment – This environment is created on-demand during recovery in the secondary Region
Backup workflow – This workflow periodically backups up Airflow metadata to the S3 buckets in the primary Region
Recovery workflow – This workflow monitors the primary Amazon MWAA environment and initiates failover when needed in the secondary Region

Figure 1: The backup restore architecture

There are essentially two workflows that work in conjunction to achieve the backup and restore functionality in this architecture. Let’s explore both workflows in detail and the steps as outlined in Figure 1.

Backup workflow

The backup workflow is responsible for periodically taking a backup of your Airflow metadata tables and storing them in the backup S3 bucket. The steps are as follows:

[1.a] You can deploy the provided solution from your continuous integration and delivery (CI/CD) pipeline. The pipeline includes a DAG deployed to the DAGs S3 bucket, which performs backup of your Airflow metadata. This is the bucket where you host all of your DAGs for your environment.
[1.b] The solution enables cross-Region replication of the DAGs bucket. Any new changes to the primary Region bucket, including DAG files, plugins, and requirements.txt files, are replicated to the secondary Region DAGs bucket. However, for existing objects, a one-time replication needs to be performed using S3 Batch Replication.
[1.c] The DAG deployed to take metadata backup runs periodically. The metadata backup doesn’t include some of the auto-generated tables and the list of tables to be backed up is configurable. By default, the solution backs up variable, connection, slot pool, log, job, DAG run, trigger, task instance, and task fail tables. The backup interval is also configurable and should be based on the Recovery Point Objective (RPO), which is the data loss time during a failure that can be sustained by your business.
[1.d] Similar to the DAGs bucket, the backup bucket is also synced using cross-Region replication, through which the metadata backup becomes available in the secondary Region.

Recovery workflow

The recovery workflow runs periodically in the secondary Region monitoring the primary Amazon MWAA environment. It has two functions:

Store the environment configuration of the primary Amazon MWAA environment in the secondary backup bucket, which is used to recreate an identical Amazon MWAA environment in the secondary Region during failure
Perform the failover when a failure is detected

The following are the steps for when the primary Amazon MWAA environment is healthy (see Figure 1):

[2.a] The Amazon EventBridge scheduler starts the AWS Step Functions workflow on a provided schedule.
[2.b] The workflow, using AWS Lambda, checks Amazon CloudWatch in the primary Region for the SchedulerHeartbeat metrics of the primary Amazon MWAA environment. The environment in the primary Region sends heartbeats to CloudWatch every 5 seconds by default. However, to not invoke a recovery workflow spuriously, we use a default aggregation period of 5 minutes to check the heartbeat metrics. Therefore, it can take up to 5 minutes to detect a primary environment failure.
[2.c] Assuming that the heartbeat was detected in 2.b, the workflow makes the cross-Region GetEnvironment call to the primary Amazon MWAA environment.
[2.d] The response from the GetEnvironment call is stored in the secondary backup S3 bucket to be used in case of a failure in the subsequent iterations of the workflow. This makes sure the latest configuration of your primary environment is used to recreate a new environment in the secondary Region. The workflow completes successfully after storing the configuration.

The following are the steps for the case when the primary environment is unhealthy (see Figure 1):

[2.a] The EventBridge scheduler starts the Step Functions workflow on a provided schedule.
[2.b] The workflow, using Lambda, checks CloudWatch in the primary Region for the scheduler heartbeat metrics and detects failure. The scheduler heartbeat check using the CloudWatch API is the recommended approach to detect failure. However, you can implement a custom strategy for failure detection in the Lambda function such as deploying a DAG to periodically send custom metrics to CloudWatch or other data stores as heartbeats and using the function to check that metrics. With the current CloudWatch-based strategy, the unavailability of the CloudWatch API may spuriously invoke the recovery flow.
[2.c] Skipped
[2.d] The workflow reads the previously stored environment details from the backup S3 bucket.
[2.e] The environment details read from the previous step is used to recreate an identical environment in the secondary Region using the CreateEnvironment API call. The API also needs other secondary Region specific configurations such as VPC, subnets, and security groups that are read from the user-supplied configuration file or environment variables during the solution deployment. The workflow in a polling loop waits until the environment becomes available and invokes the DAG to restore metadata from the backup S3 bucket. This DAG is deployed to the DAGs S3 bucket as a part of the solution deployment.
[2.f] The DAG for restoring metadata completes hydrating the newly created environment and notifies the Step Functions workflow of completion using the task token integration. The new environment now starts running the active workflows and the recovery completes successfully.

Considerations

Consider the following when using the backup and restore method:

Recovery Time Objective – From failure detection to workflows running in the secondary Region, failover can take over 30 minutes. This includes new environment creation, Airflow startup, and metadata restore.
Cost – This strategy avoids the overhead of running a passive environment in the secondary Region. Costs are limited to periodic backup storage, cross-Region data transfer charges, and minimal compute for the recovery workflow.
Data loss – The RPO depends on the backup frequency. There is a design trade-off to consider here. Although shorter intervals between backups can minimize potential data loss, too frequent backups can adversely affect the performance of the metadata database and consequently the primary Airflow environment. Also, the solution can’t recover an actively running workflow midway. All active workflows are started fresh in the secondary Region based on the provided schedule.
Ongoing management – The Amazon MWAA environment and dependencies are automatically kept in sync across Regions in this architecture. As specified in the Step 1.b of the backup workflow, the DAGs S3 bucket will need a one-time deployment of the existing resources for the solution to work.

Warm standby architecture

The warm standby strategy involves deploying identical Amazon MWAA environments in two Regions. Periodic metadata backups from the primary Region are used to rehydrate the standby environment in case of failover.

The project uses the AWS CDK and is set up like a standard Python project. Refer to the detailed deployment steps in the README file to deploy it in your own accounts.

The following diagram shows the architecture of the warm standby strategy and its key components:

Primary Amazon MWAA environment – The environment in the primary Region hosts the workflows during normal operation
Secondary Amazon MWAA environment – The environment in the secondary Region acts as a warm standby ready to take over at any time
Metadata backup bucket – The bucket in the primary Region stores periodic backups of Airflow metadata tables
Replicated backup bucket – The bucket in the secondary Region syncs metadata backups through S3 Cross-Region Replication.
Backup workflow – This workflow periodically backups up Airflow metadata to the S3 buckets in both Regions
Recovery workflow – This workflow monitors the primary environment and initiates failover to the secondary environment when needed

Figure 2: The warm standby architecture

Similar to the backup and restore strategy, the backup workflow (Steps 1a–1d) periodically backups up critical Amazon MWAA metadata to S3 buckets in the primary Region, which is synced in the secondary Region.

The recovery workflow runs periodically in the secondary Region monitoring the primary environment. On failure detection, it initiates the failover procedure. The steps are as follows (see Figure 2):

[2.a] The EventBridge scheduler starts the Step Functions workflow on a provided schedule.
[2.b] The workflow checks CloudWatch in the primary Region for the scheduler heartbeat metrics and detects failure. If the primary environment is healthy, the workflow completes without further actions.
[2.c] The workflow invokes the DAG to restore metadata from the backup S3 bucket.
[2.d] The DAG for restoring metadata completes hydrating the passive environment and notifies the Step Functions workflow of completion using the task token integration. The passive environment starts running the active workflows on the provided schedules.

Because the secondary environment is already warmed up, the failover is faster with recovery times in minutes.

Considerations

Consider the following when using the warm standby method:

Recovery Time Objective – With a warm standby ready, the RTO can be as low as 5 minutes. This includes just the metadata restore and reenabling DAGs in the secondary Region.
Cost – This strategy has an added cost of running similar environments in two Regions at all times. With auto scaling for workers, the warm instance can maintain a minimal footprint; however, the web server and scheduler components of Amazon MWAA will remain active in the secondary environment at all times. The trade-off is significantly lower RTO.
Data loss – Similar to the backup and restore model, the RPO depends on the backup frequency. Faster backup cycles minimize potential data loss but can adversely affect performance of the metadata database and consequently the primary Airflow environment.
Ongoing management – This approach comes with some management overhead. Unlike the backup and restore strategy, any changes to the primary environment configurations need to be manually reapplied to the secondary environment to keep the two environments in sync. Automated synchronization of the secondary environment configurations is a future work.

Shared considerations

Although the backup and restore and warm standby strategies differ in their implementation, they share some common considerations:

Periodically test failover to validate recovery procedures, RTO, and RPO.
Enable Amazon MWAA environment logging to help debug issues during failover.
Use the AWS CDK or AWS CloudFormation to manage the infrastructure definition. For more details, see the following GitHub repo or Quick start tutorial for Amazon Managed Workflows for Apache Airflow, respectively.
Automate deployments of environment configurations and disaster recovery workflows through CI/CD pipelines.
Monitor key CloudWatch metrics like SchedulerHeartbeat to detect primary environment failures.

Conclusion

In this series, we discussed how backup and restore and warm standby strategies offer configurable data protection based on your RTO, RPO, and cost requirements. Both use periodic metadata replication and restoration to minimize the area of effect of Regional outages.

Which strategy resonates more with your use case? Feel free to try out our solution and share any feedback or questions in the comments section!

About the Authors

Chandan Rupakheti is a Senior Solutions Architect at AWS. His main focus at AWS lies in the intersection of Analytics, Serverless, and AdTech services. He is a passionate technical leader, researcher, and mentor with a knack for building innovative solutions in the cloud. Outside of his professional life, he loves spending time with his family and friends besides listening and playing music.

Parnab Basak is a Senior Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

2024-05-31 Anand Komandooru

Post Syndicated from Anand Komandooru original https://aws.amazon.com/blogs/big-data/implement-a-full-stack-serverless-search-application-using-aws-amplify-amazon-cognito-amazon-api-gateway-aws-lambda-and-amazon-opensearch-serverless/

Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability.

Amazon OpenSearch Serverless is a powerful and scalable search and analytics engine that can significantly contribute to the development of search applications. It allows you to store, search, and analyze large volumes of data in real time, offering scalability, real-time capabilities, security, and integration with other AWS services. With OpenSearch Serverless, you can search and analyze a large volume of data without having to worry about the underlying infrastructure and data management. An OpenSearch Serverless collection is a group of OpenSearch indexes that work together to support a specific workload or use case. Collections have the same kind of high-capacity, distributed, and highly available storage volume that’s used by provisioned Amazon OpenSearch Service domains, but they remove complexity because they don’t require manual configuration and tuning. Each collection that you create is protected with encryption of data at rest, a security feature that helps prevent unauthorized access to your data. OpenSearch Serverless also supports OpenSearch Dashboards, which provides an intuitive interface for analyzing data.

OpenSearch Serverless supports three primary use cases:

Time series – The log analytics workloads that focus on analyzing large volumes of semi-structured, machine-generated data in real time for operational, security, user behavior, and business insights
Search – Full-text search that powers applications in your internal networks (content management systems, legal documents) and internet-facing applications, such as ecommerce website search and content search
Vector search – Semantic search on vector embeddings that simplifies vector data management and powers machine learning (ML) augmented search experiences and generative artificial intelligence (AI) applications, such as chatbots, personal assistants, and fraud detection

In this post, we walk you through a reference implementation of a full-stack cloud-centered serverless text search application designed to run using OpenSearch Serverless.

Solution overview

The following services are used in the solution:

AWS Amplify is a set of purpose-built tools and features that enables frontend web and mobile developers to quickly and effortlessly build full-stack applications on AWS. These tools have the flexibility to use the breadth of AWS services as your use cases evolve. This solution uses the Amplify CLI to build the serverless movie search web application. The Amplify backend is used to create resources such as the Amazon Cognito user pool, API Gateway, Lambda function, and Amazon S3 storage.
Amazon API Gateway is a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale. We use API Gateway as a “front door” for the movie search application for searching movies.
AWS CloudFront accelerates the delivery of web content such as static and dynamic web pages, video streams, and APIs to users across the globe by caching content at edge locations closer to the end-users. This solution uses CloudFront with Amazon S3 to deliver the search application user interface to the end users.
Amazon Cognito makes it straightforward for adding authentication, user management, and data synchronization without having to write backend code or manage any infrastructure. We use Amazon Cognito for creating a user pool so the end-user can log in to the movie search application through Amazon Cognito.
AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Our solution uses a Lambda function to query OpenSearch Serverless. API Gateway forwards all requests to the Lambda function to serve up the requests.
Amazon OpenSearch Serverless is a serverless option for OpenSearch Service. In this post, you use common methods for searching documents in OpenSearch Service that improve the search experience, such as request body searches using domain-specific language (DSL) for queries. The query DSL lets you specify the full range of OpenSearch search options, including pagination and sorting the search results. Pagination and sorting are implemented on the server side using DSL as part of this implementation.
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The solution uses Amazon S3 as storage for storing movie trailers.
AWS WAF helps protects web applications from attacks by allowing you to configure rules that allow, block, or monitor (count) web requests based on conditions that you define. We use AWS WAF to allow access to the movie search app from only IP addresses on an allow list.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

The end-user accesses the CloudFront and Amazon S3 hosted movie search web application from their browser or mobile device.
The user signs in with their credentials.
A request is made to an Amazon Cognito user pool for a login authentication token, and a token is received for a successful sign-in request.
The search application calls the search API method with the token in the authorization header to API Gateway. API Gateway is protected by AWS WAF to enforce rate limiting and implement allow and deny lists.
API Gateway passes the token for validation to the Amazon Cognito user pool. Amazon Cognito validates the token and sends a response to API Gateway.
API Gateway invokes the Lambda function to process the request.
The Lambda function queries OpenSearch Serverless and returns the metadata for the search.
Based on metadata, content is returned from Amazon S3 to the user.

In the following sections, we walk you through the steps to deploy the solution, ingest data, and test the solution.

Prerequisites

Before you get started, make sure you complete the following prerequisites:

Install Nodejs latest LTS version.
Install and configure the AWS Command Line Interface (AWS CLI).
Install awscurl for data ingestion.
Install and configure the Amplify CLI. At the end of configuration, you should successfully set up the new user using the amplify-dev user’s AccessKeyId and SecretAccessKey in your local machine’s AWS profile.
Amplify users need additional permissions in order to deploy AWS resources. Complete the following steps to create a new inline AWS Identity and Access Management (IAM) policy and attach it to the user:

- On the IAM console, choose Users in the navigation pane.
- Choose the user amplify-dev.
- On the Permissions tab, choose the Add permissions dropdown menu, then choose Inline policy.
- In the policy editor, choose JSON.

You should see the default IAM statement in JSON format.

- Copy the file contents in AddionalPermissions-Amplify, replacing the tags with your target AWS Region, account, and environment.

This environment name needs to be used when performing amplify init when bringing up the backend. The actions in the IAM statement are largely open (*) but restricted or limited by the target resources; this is done to satisfy the maximum inline policy length (2,048 characters).

- Enter the updated JSON into the policy editor, then choose Next.
- For Policy name, enter a name (for this post, AddionalPermissions-Amplify).
- Choose Create policy.

You should now see the new inline policy attached to the user.

Deploy the solution

Complete the following steps to deploy the solution:

Clone the repository to a new folder on your desktop using the following command:

git clone https://github.com/aws-samples/amazon-opensearchserverless-searchapp.git

Deploy the movie search backend.
Deploy the movie search frontend.

Ingest data

To ingest the sample movie data into the newly created OpenSearch Serverless collection, complete the following steps:

On the OpenSearch Service console, choose Ingestion: Pipelines in the navigation pane.
Choose the pipeline movie-ingestion and locate the ingestion URL.

Replace the ingestion endpoint and Region in the following snippet and run the awscurl command to save data into the collection:

awscurl --service osis --region <region> \
-X POST \
-H "Content-Type: application/json" \
-d "@project_assets/movies-data.json" \
https://<ingest_url>/movie-ingestion/data

You should see a 200 OK response.

On the Amazon S3 console, open the trailer S3 bucket (created as part of the backend deployment.
Upload some movie trailers.

Storage

Make sure the file name matches the ID field in sample movie data (for example, tt1981115.mp4, tt0800369.mp4, and tt0172495.mp4). Uploading a trailer with ID tt0172495.mp4 is used as the default trailer for all movies, without having to upload one for each movie.

Test the solution

Access the application using the CloudFront distribution domain name. You can find this by opening the CloudFront console, choosing the distribution, and copying the distribution domain name into your browser.

Sign up for application access by entering your user name, password, and email address. The password should be at least eight characters in length, and should include at least one uppercase character and symbol.

After you’re logged in, you’re redirected to the Movie Finder home page.

Home Page

You can search using a movie name, actor, or director, as shown in the following example. The application returns results using OpenSearch DSL.

Search Results

If there’s a large number of search results, you can navigate through them using the pagination option at the bottom of the page. For more information about how the application uses pagination, see Paginating search results.

Pagination

You can choose movie tiles to get more details and watch the trailer if you took the optional step of uploading a movie trailer.

Movie Details

You can sort the search results using the Sort by feature. The application uses the sort functionality within OpenSearch.

Sort

There are many more DSL search patterns that allow for intricate searches. See Query DSL for complete details.

Monitoring OpenSearch Serverless

Monitoring is an important part of maintaining the reliability, availability, and performance of OpenSearch Serverless and your other AWS services. AWS provides Amazon CloudWatch and AWS CloudTrail to monitor OpenSearch Serverless, report when something is wrong, and take automatic actions when appropriate. For more information, see Monitoring Amazon OpenSearch Serverless.

Clean up

To avoid unnecessary charges, clean up the solution implementation by running the following command at the project root folder you created using the git clone command during deployment:

amplify delete

You can also clean up the solution by deleting the AWS CloudFormation stack you deployed as part of the setup. For instructions, see Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, we implemented a full-stack serverless search application using OpenSearch Serverless. This solution seamlessly integrates with various AWS services, such as Lambda for serverless computing, API Gateway for constructing RESTful APIs, IAM for robust security, Amazon Cognito for streamlined user management, and AWS WAF for safeguarding the web application against threats. By adopting a serverless architecture, this search application offers numerous advantages, including simplified deployment processes and effortless scalability, with the benefits of a managed infrastructure.

With OpenSearch Serverless, you get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. You pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application without impacting performance and scale as needed. You can use OpenSearch Serverless and this reference implementation to build your own full-stack text search application.

About the Authors

Anand Komandooru is a Principal Cloud Architect at AWS. He joined AWS Professional Services organization in 2021 and helps customers build cloud-native applications on AWS cloud. He has over 20 years of experience building software and his favorite Amazon leadership principle is “Leaders are right a lot“.

Rama Krishna Ramaseshu is a Senior Application Architect at AWS. He joined AWS Professional Services in 2022 and with close to two decades of experience in application development and software architecture, he empowers customers to build well architected solutions within the AWS cloud. His favorite Amazon leadership principle is “Learn and Be Curious”.

Sachin Vighe is a Senior DevOps Architect at AWS. He joined AWS Professional Services in 2020, and specializes in designing and architecting solutions within the AWS cloud to guide customers through their DevOps and Cloud transformation journey. His favorite leadership principle is “Customer Obsession”.

Molly Wu is an Associate Cloud Developer at AWS. She joined AWS Professional Services in 2023 and specializes in assisting customers in building frontend technologies in AWS cloud. Her favorite leadership principle is “Bias for Action”.

Andrew Yankowsky is a Security Consultant at AWS. He joined AWS Professional Services in 2023, and helps customers build cloud security capabilities and follow security best practices on AWS. His favorite leadership principle is “Earn Trust”.

Using the circuit-breaker pattern with AWS Lambda extensions and Amazon DynamoDB

2024-05-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-lambda-extensions-and-amazon-dynamodb/

This post is written by Alan Oberto Jimenez, Senior Cloud Application Architect, and Tobias Drees, Cloud Application Architect.

Modern software systems frequently rely on remote calls to other systems across networks. When failures occur, they can cascade across multiple services causing service disruptions. One technique for mitigating this risk is the circuit breaker pattern, which can detect and isolate failures in a distributed system. The circuit breaker pattern can help prevent cascading failures and improve overall system stability.

The pattern isolates the failing service and thus prevents cascading failures. It improves the overall responsiveness by preventing long waiting times for timeout periods. Furthermore, it also increases the fault tolerance of the system since it lets the system interact with the affected service again once it is available again.

This blog post presents an example application, showing how AWS Lambda extensions integrate with Amazon DynamoDB to implement the circuit breaker pattern.

Using Lambda extensions to implement the circuit breaker pattern

AWS Lambda extensions provide a way to integrate monitoring, observability, security, and governance tools into the Lambda execution environment without complex installation or configuration management. You can run extensions both as part of the runtime process with an internal extension or as a separate process in the execution environment with an external extension.

Lambda extensions enable the circuit breaker pattern without modifying the core function code. An external extension checks in a separate runtime whether a certain service is reachable or not. This approach decouples the business logic in the Lambda function from failure detection, allowing for the reuse of this Lambda extension across different Lambda functions. Both decoupling of code with different purposes and code reuse is in line with the best practices for building Lambda functions.

Pinging a microservice at each Lambda invocation increases network traffic and latency. Circuit breaker implementations benefit from a caching layer to store the state of the microservices. The Lambda extension fetches the status of a microservice from a database and stores the result in memory for a specified time avoiding a disk write. The Lambda function checks the extension cache before pinging the microservice reducing network traffic. Lambda extensions are an ideal tool to build a caching layer for Lambda functions since its in-memory cache makes it more secure, easier to manage, and more performant due to higher availability compared to calling a network resource instead.

Overview

The main function process handles the event after every AWS Lambda invocation. Before performing any external call against the external components, it listens for HTTP POST events from the Lambda extension process to fetch the last status of the circuits.
The extension process provides the circuit state to the main process via HTTP POST.
1. The extension checks its internal cache and returns a valid value if available, otherwise reads the state of the circuits from the DynamoDB table and updates the cache.
2. Finally, the extension process returns the state of the circuits to the main function via an API call response.
3. Because of the Lambda extensions lifecycle, this process occurs periodically to keep the local cache updated until the execution environment is terminated.
If the circuit is in the OPEN state, the main function process executes calls against the external microservices, otherwise the process returns a local response.
An Amazon EventBridge event periodically invokes a Lambda responsible for updating the circuit states.
This Lambda function performs the validations needed to determine the status of the different remote microservices (circuits) with an Amazon API Gateway entrypoint.
The Lambda function writes the result of the verification process to the DynamoDB table.

Walkthrough

The following prerequisites are required to complete the walkthrough:

An active AWS account
AWS CLI 2.15.17 or later
AWS SAM CLI 1.116.0 or later
Git 2.39.3 or later
Python 3.12

Initial setup

Clone the code from GitHub onto a local machine:

git clone https://github.com/aws-samples/implementing-the-circuit-breaker-pattern-with-lambda-extensions-and-dynamodb.git

To install the packages, utilize a virtual environment:

python -m venv circuit_breaker_venv && source circuit_breaker_venv/bin/activate

To prepare the services for deployment, execute the following AWS Serverless Application Model (SAM) command:
```
sam build
```
To deploy the services, use this command specifying the AWS CLI profile (in the config file in the .aws folder) for the AWS account to deploy the services in:
```
sam deploy --guided --profile <AWSProfile>
```
Answer the question prompts as appropriate.
You can deploy subsequent local changes in the code with:
```
sam build 
sam deploy
```

Testing and adjusting the solution

The Lambda function updating the state in DynamoDB runs every minute as specified by the template. After the function has run for the first time after 1 minute, the DynamoDB entry containing the status (“OPEN” or “CLOSED”) is ready. Since the mock API is part of the stack, the status is “OPEN”.

You can invoke the My Microservice Lambda function manually to see:

The Lambda function updating the state in DynamoDB is invoked with an EventBridge rule that specifies the URL and the ID of the service to be monitored. By creating a new EventBridge rule with the correct URL and a new ID, you can use the AWS SAM template for monitoring multiple services.

To add a new EventBridge rule, add this to the template:

  NewEventRule:
    Type: AWS::Events::Rule
    Properties:
      Description: Event rule to trigger the Lambda function with a JSON payload
      ScheduleExpression: rate(1 minute) 
      State: ENABLED
      Targets:
        - Arn: !GetAtt UpdatingStateLambda.Arn
          Id: TargetFunction
          Input: '{ "URL": "https://aws.amazon.com/", "ID": "NewMicroservice"}'  # Add the JSON payload here

  MyPermissionForNewEventRule:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref UpdatingStateLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt NewEventRule.Arn

In the Lambda function that contains the business logic, add the following environment variables. However, for more complex cases with multiple microservices to be monitored, it’s recommended to use AWS Config. Using AWS Config, configurations for Lambda functions can be stored to enable more granular control than with environment variables.

Environment:
        Variables:
          service_name: "NewMicroservice"

You can adjust the logic of this Lambda function by changing the code in my-microservice/lambda-handler.py or directly in the Lambda section of the AWS Management Console.

If you end up using your own Lambda function to use the circuit breaker Lambda extension, include the circuit breaker extension as a layer:

BusinessLogicMicroservice:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: business-logic-microservice/
      Handler: lambda_function.lambda_handler
      MemorySize: 128
      Policies:
      - DynamoDBCrudPolicy:
          TableName: !Ref CircuitBreakerStateTable
      Timeout: 100
      Runtime: python3.8
      Layers:
      - !Ref CircuitBreakerExtensionLayer

Circuit breaker in closed state

So far, the sample application only features an open circuit breaker state signaling a functioning microservice. This section simulates an unresponsive microservice to test the behavior of the system with a closed-circuit breaker state.

Edit the environment variables of the MyMicroservice Lambda function in line 47 of the template.yaml file and the URL of the input to the Lambda updating the state in the event rule in line 107 to a domain that times out such as ”https://aws.amazon.com:81/“.
```
API_URL: "https://aws.amazon.com:81/"
Input: '{ "URL": "https://aws.amazon.com:81/", "ID": "MyMicroservice"}'
```
Deploy these changes:
```
sam build
sam deploy
```

The event rule invokes the Lambda function, updating the state every minute. To see the output of this Lambda function, invoke it manually:

This Lambda function changes the DynamoDB entry for this URL to:

The MyMicroservice Lambda function receives the DynamoDB entries for the status over HTTP from the Circuit Breaker Lambda extension and proceeds with the logic following a closed state. The output of invoking the Lambda manually is:

This shows the circuit breaker pattern working as intended. In the Lambda updating state, the time it takes for the Lambda function to throw a timeout exception is defined as 4 seconds and can be adjusted to the use case.

requests.get(API_URL, headers=headers, timeout=4)

Clean-up

To delete all resources from this stack, run:

sam delete --stack-name new-circuit-breaker-sam-stack

Security

The provided AWS SAM template does not provide an Amazon Virtual Private Cloud (VPC) in which to host the resources. Integrate the resources into an appropriate networking configuration if you are using it in production applications.

The solution has auditability characteristics, as calls to the circuit breaker and to the microservices are logged to the Amazon CloudWatch log group. The audit log is encrypted using AWS Key Management Service.

To monitor the security of your account with the solution, use Amazon GuardDuty, AWS CloudTrail, AWS Config, and AWS WAF for API Gateway.

Conclusion

The circuit breaker pattern is a powerful tool for helping to ensure the resiliency and stability of serverless applications. Lambda extensions are a good fit for its implementation, as demonstrated in this example. With the provided Lambda extension and code, you can incorporate the circuit breaker pattern into your applications and customize it to suit your specific requirements, helping to ensure a robust and reliable system.

For more serverless learning resources, visit Serverless Land.

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

2024-05-15 Aidin Khosrowshahi

Post Syndicated from Aidin Khosrowshahi original https://aws.amazon.com/blogs/big-data/analyze-elastic-ip-usage-history-using-amazon-athena-and-aws-cloudtrail/

An AWS Elastic IP (EIP) address is a static, public, and unique IPv4 address. Allocated exclusively to your AWS account, the EIP remains under your control until you decide to release it. It can be allocated to your Amazon Elastic Compute Cloud (Amazon EC2) instance or other AWS resources such as load balancers.

EIP addresses are designed for dynamic cloud computing because they can be re-mapped to another instance to mask any disruptions. These EIPs are also used for applications that must make external requests to services that require a consistent address for allow listed inbound connections. As your application usage varies, these EIPs might see sporadic use over weeks or even months, leading to potential accumulation of unused EIPs that may inadvertently inflate your AWS expenditure.

In this post, we show you how to analyze EIP usage history using AWS CloudTrail and Amazon Athena to have a better insight of your EIP usage pattern in your AWS account. You can use this solution regularly as part of your cost-optimization efforts to safely remove unused EIPs to reduce your costs.

Solution overview

This solution uses activity logs from CloudTrail and the power of Athena to conduct a comprehensive analysis of historical EIP attachment activity within your AWS account. CloudTrail, a critical AWS service, meticulously logs API activity within an AWS account.

Athena is an interactive query service that simplifies data analysis in Amazon Simple Storage Service (Amazon S3) using standard SQL. It is a serverless service, eliminating the need for infrastructure management and costing you only for the queries you run.

By extracting detailed information from CloudTrail and querying it using Athena, this solution streamlines the process of data collection, analysis, and reporting of EIP usage within an AWS account.

To gather EIP usage reporting, this solution compares snapshots of the current EIPs, focusing on their most recent attachment within a customizable 3-month period. It then determines the frequency of EIP attachments to resources. An attachment count greater than zero suggests that the EIPs are actively in use. In contrast, an attachment count of zero indicates that these EIPs are idle and can be released, aiding in identifying potential areas for cost reduction.

In the following sections, we show you how to deploy the solution using AWS CloudFormation and then run an analysis.

Prerequisites

Complete the following prerequisite steps:

If your account doesn’t have CloudTrail enabled, create a trail, then capture the S3 bucket name to use later in the implementation steps.
Download the CloudFormation template from the repository. You need this template.yaml file for the implementation steps.

Deploy the solution

In this section, you use AWS CloudFormation to create the required resources. AWS CloudFormation is a service that helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS.

The CloudFormation template creates Athena views and a table to search past AssociateAddress events in CloudTrail, an AWS Lambda function to collect snapshots of existing EIPs, and an S3 bucket to store the analysis results.

Complete the following steps:

On the AWS CloudFormation console, choose on Create stack and choose With new resources (standard).
In the Specify Template section, choose an existing template and upload the template.yaml file downloaded from the prerequisites.
In the Specify stack details section, enter your preferred stack name and the existing CloudTrail S3 location, and maintain the default settings for the other parameters.
At the bottom of the Review and create page, select the acknowledgement check box, then choose Submit.

Wait for the stack to be created. It should take a few minutes to complete. You can open the AWS CloudFormation console to view the stack creation process.

Run an analysis

You have configured the solution to run your EIP attachments analysis. Complete the following steps to analyze your EIP attachment history. If you’re using Athena for the first time in your account, you need to set up a query result location in Amazon S3.

On the Athena console, navigate to the query editor.
For Database, choose default.
Enter the following query and choose Run query:

select 
eip.publicip,
eip.allocationid,
eip.region,
eip.accountid,
eip.associationid, 
eip.PublicIpv4Pool,
max(associate_ip_event.eventtime) as latest_attachment,
count(associate_ip_event.associationid) as attachmentCount
from eip LEFT JOIN associate_ip_event on associate_ip_event.allocationid = eip.allocationid 
group by 1,2,3,4,5,6

All the required tables are created under the default database.

You can now run a query on the CloudTrail logs to look back in time for the EIP attachment. This query provides you with better insight to safely release idle EIPs in order to reduce costs by displaying how frequently each specific EIP was previously attached to any resources.

This report will provide the following information:

Public IP
Allocation ID (the ID that AWS assigns to represent the allocation of the EIP address for use with instances in a VPC)
Region
Account ID
latest_attachment date (the last time EIP was attached to a resource)
attachmentCount (number of attachments)
The association ID for the address (if this field is empty, the EIP is idle and not attached to any resources)

The following screenshot shows the query results.

Clean up

To optimize cost, clean up the resources you deployed for this post by completing the following steps:

Delete the contents in your S3 buckets (eip-analyzer-eipsnapshot-* and eip-analyzer-athenaresulteipanalyzer-*).
Delete the S3 buckets.
On the AWS CloudFormation console, delete the stack you created.

Conclusion

This post demonstrated how you can analyze Elastic IP usage history to have a better insight of EIP attachment patterns using Athena and CloudTrail. Check out the GitHub repo to regularly run this analysis as part of your cost-optimization strategy to identify and release inactive EIPs to reduce costs.

You can also use Athena to analyze logs from other AWS services; for more information, see Querying AWS service logs.

Additionally, you can analyze activity logs with AWS CloudTrail Lake and Amazon Athena. AWS CloudTrail Lake is a managed data lake that enables organizations to aggregate, immutably store, and query events recorded by CloudTrail for auditing, security investigation, and operational troubleshooting. AWS CloudTrail Lake supports the collection of events from multiple AWS regions and AWS accounts. For CloudTrail Lake, you pay for data ingestion, retention, and analysis. Refer to AWS CloudTrail Lake pricing page for pricing details.

About the Author

Aidin Khosrowshahi is a Senior Technical Account Manager with Amazon Web Services based out of San Francisco. He focuses on reliability, optimization, and improving operational mechanisms with his customers.

Governing and securing AWS PrivateLink service access at scale in multi-account environments

2024-05-14 Anandprasanna Gaitonde

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/security/governing-and-securing-aws-privatelink-service-access-at-scale-in-multi-account-environments/

Amazon Web Services (AWS) customers have been adopting the approach of using AWS PrivateLink to have secure communication to AWS services, their own internal services, and third-party services in the AWS Cloud. As these environments scale, the number of PrivateLink connections outbound to external services and inbound to internal services increase and are spread out across multiple accounts in virtual private clouds (VPCs). While AWS Identity and Access Management (IAM) policies allow you to control access to individual PrivateLink services, customers want centralized governance for the use of PrivateLink in adherence with organizational standards and security needs.

This post provides an approach for centralized governance for PrivateLink based services across your multi-account environment. It provides a way to create preventative controls through the use of service control policies (SCPs) and detective controls through event-driven automation. This allows your application teams to consume internal and external services while adhering to organization policies and provides a mechanism for centralized control as your AWS environment grows.

Scenarios faced by customers

Figure 1 shows an example customer environment comprising a multi-account structure created through AWS Organizations or using AWS Control Tower. There are separate organizational units (OUs) pertaining to different business units (BUs) with respective accounts. The business services’ account hosts several backend services that are utilized by consuming applications for their functionality. Since these services provide functionality to more than one internal application and will require access across VPC and account boundaries, these are exposed through AWS PrivateLink. One such service is shown in the business services account.

The customer has partners that provide services for integration with the customer’s application stack. The approved partner account provides a service that is approved for use by the cloud administration team. The NotApproved partner account provides services that are not approved within the customer’s organization. The customer has another OU dedicated to application teams. The application 1 account has an application that consumes the business service of the approved partner account. It is also planning to use the service from the NotApproved partner, which should be blocked. The application in the application 2 account is planning on using AWS services through interface endpoints as well as the approved partner account through PrivateLink integration.

Note: Throughout this post, “organization” is used to refer to an organization that you create and manage through AWS Organizations.

Figure 1: A multi-account customer environment

Current challenges

Access to individual PrivateLink connections can be controlled through IAM policies. At scale, however, different teams use and adopt PrivateLink for incoming and outgoing connections, and the number of VPC endpoint policies to create and manage increases. As mentioned in the problem statement presented in the introduction, as the customer environment scales and the number of PrivateLink connections increases, customers want centralized guardrails to manage PrivateLink resources at scale. For our example, the customer would like to put the following controls in place:

Preventative controls:

Use case 1:

Allow creation of VPC endpoints and allow access only to PrivateLink enabled AWS services.
Allow creation of VPC endpoints and initiating connection only to approved PrivateLink enabled third-party services.
Allow creation of VPC endpoints and initiating connection only to internal business services owned by accounts in the same organization.

Use case 2:

Allow only a cloud admin role to add permissions to connect to an endpoint service to prevent connections from external clients to internal VPC endpoint services.

Detective controls:

Use case 3:

Detect if connections are made to PrivateLink services exposed by AWS accounts not belonging to the customer’s organization.

Use case 4:

Detect if connections are made by external AWS accounts (not belonging to the customer’s organization) to PrivateLink services exposed for internal use by the customer’s AWS accounts.

This post presents a solution that uses SCPs, AWS CloudTrail, and AWS Config to achieve governance. When the solution is deployed in your account, the following components are created as part of the architecture, as shown in Figure 2.

Figure 2: Resources deployed in the customer environment by the solution

The following architecture is now in place:

SCPs to provide preventative controls for the PrivateLink connections.
Amazon EventBridge rules that are configured to trigger based on events from API calls captured by CloudTrail in specified accounts within specified OUs.
EventBridge rules in member accounts to send events to the event bus in the Audit account, and a central EventBridge rule in that account to trigger an AWS Lambda function based on PrivateLink related API calls.
A Lambda function that receives the events and validates if the VPC endpoint API call is allowed for the PrivateLink service and notifies a cloud administrator if a policy is violated.
An AWS Config rule that checks if PrivateLink enabled VPC endpoint services created within your AWS accounts have enabled auto accept of client connections and disabled notifications.

Use cases and solution approach

This section walks through each use case and how the solution components are used to address each use case.

Preventative control

Use case 1: Allowing the creation of a VPC endpoint connection to only AWS services and approved internal and third-party PrivateLink services

This solution allows creating a VPC endpoint for only approved partner PrivateLink services, PrivateLink services internal to the organization, and AWS services. This is implemented using an SCP and can be enforced at the individual account or OU. The approved partner services as well as the internal accounts that can host allowed PrivateLink services can be specified during the solution deployment. Application teams operating in AWS accounts within the customer environment can then create VPC endpoints to PrivateLink services of approved partners or AWS services. However, they will not be able to create a VPC endpoint to an unapproved PrivateLink service, for example. This is shown in Figure 3.

Figure 3: Allowed and disallowed paths in PrivateLink connections by SCP

The SCP that allows you to do this preventative control is shown in the following code snippet. In this example SCP policy, AllowedPrivateLinkPartnerService-ServiceName refers to the service name of the allowed partner PrivateLink. Also, the SCP allows the creation of VPC endpoints to internal PrivateLink services that are hosted in AllowedPrivateLinkAccount. Make sure that this SCP does not interfere with the other policies you created within your organization. The solution currently uses ec2:VpceServiceName and ec2:VpceServiceOwner conditions to identify the PrivateLink service of AWS services or a third-party partner. These conditions can be used in an SCP to control the creation of VPC endpoints:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Condition": {
        "StringNotEquals": {
          "ec2:VpceServiceName": [
            "AllowedPrivateLinkPartnerService-ServiceName",
          ],
          "ec2:VpceServiceOwner": [
            "AllowedPrivateLinkAccount",
            "amazon"
          ]
        }
      },
      "Action": [
        "ec2:CreateVpcEndpoint"
      ],
      "Resource": "arn:aws:ec2:*:*:vpc-endpoint/*",
      "Effect": "Deny",
      "Sid": "SCPDenyPrivateLink"
    }
  ]
}

Use case 2: Allow only a cloud admin role to add permissions to connect to an endpoint service

This solution makes sure that PrivateLink services that are owned and created in AWS accounts of the customer cannot be connected to consumers unless it is allowed by the cloud administrator role. The cloud administrator can then make sure that only legitimate internal AWS accounts are allowed access to that service and restrict access from other accounts outside of the customer’s organization. This is achieved through the use of a service control policy that will restrict modifications of permissions of the PrivateLink endpoint service. This makes sure that individual teams are not able to use the Allow principals configuration to open access to other entities directly, and only a cloud administrator role with the right permissions can make that change.

{
  "Version": "2012-10-17",
  "Statement": [
  
      "Sid": "Statement1",
      "Effect": "Deny",
      "Action": [
        "ec2:ModifyVpcEndpointServicePermissions"
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/CloudNetworkAdmin"
          ]
        }
      }
    }
  ]
}

This policy can help in achieving the access control, as shown in Figure 4. The cloud administrator uses the Allow principals configuration of the business services PrivateLink service to provide access only to the application 1 account. The SCP allows only the cloud administrator to make the modification and does not allow another member of the team from bypassing that process and adding a nonapproved client application account to access the internal PrivateLink service.

Figure 4: Centralized control on access to the internal PrivateLink service to the customer’s own accounts

Detective controls

For detective controls, we discuss two use cases that are deployed as part of the solution and can be enabled and disabled based on the test that you want to perform.

Use case 3: Detecting if connections are made by external AWS accounts (not belonging to the customer’s organization) to PrivateLink services exposed by the customer’s AWS accounts

In this use case, the customer would like to detect if connections are made to their business services from accounts outside of its organization. The solution uses individual member account trails for capturing API calls across the multi-account structure and cross-account EventBridge integration. CloudTrail events from member accounts capture events when a PrivateLink service connection is accepted through the API call event AcceptVPCConnectionEndpoint and sent to the event bus in the audit account. This triggers a Lambda function that then captures the information of the entity requesting the connection and details of the PrivateLink service and sends a notification to the cloud administrator. This is shown in Figure 5.

Figure 5: Detecting the creation of a VPC endpoint or accepting a PrivateLink service connection using CloudTrail events in EventBridge

Custom AWS Config rule for detective control

This detective control mechanism works in cases where PrivateLink services are configured to manually accept client connections. If the endpoint is configured to automatically accept connections, CloudTrail will not generate an event when a connection is accepted. AWS PrivateLink allows customers to configure connection notifications to send connection notification events to an Amazon Simple Notification Service (Amazon SNS) topic. Cloud administrators can get the notifications if they are subscribed to the SNS topic. However, if the notification configuration is removed by the member account, there is no way for the cloud administrator to have visibility for new connections and effectively apply governance requirements.

This solution employs an AWS Config rule to detect if PrivateLink services are created with the Auto Accept Connections setting enabled or without a connection notification configuration and flag it as noncompliant.

This is depicted in Figure 6.

Figure 6: Custom AWS Config rule and SNS notification deployed as part of the solution

When a PrivateLink service is created by one of the business services teams, an AWS Config organization rule in the audit account will detect the event, and the custom Lambda function will check if the connection notification configuration is present. If not, then the AWS Config rule will flag the resource as noncompliant. Cloud administrators can view these in the AWS Config dashboard or receive notifications configured through AWS Config.

Use case 4: Detecting if connections are made to PrivateLink services exposed by AWS accounts not belonging to the customer’s organization.

Using the same approach as presented in use case 3, connections made to PrivateLink services exposed by AWS accounts outside of the customer’s organization can be detected through the API call event from CloudTrail CreateVPCEndpoint. This event is sent to the centralized event bus and the Lambda function to check against the criteria and provide notifications to the cloud administrator.

Deploy and test the solution

This section walks through how to deploy and test our recommended solution.

Prerequisites

To deploy the solution, first follow these steps.

In your AWS Organizations multi-account environment, go to the management account and enable trusted access for AWS CloudFormation, enable trusted access for AWS Config, and enable trusted access for CloudTrail.
Identify an account in your organization to serve as the audit account and set it up as a delegated administrator for CloudFormation, AWS Config, and CloudTrail. Follow these steps to perform this step:
1. Register a delegated administrator for CloudFormation.
2. Perform the steps mentioned in step 1 of this post to register a delegated administrator for AWS Config.
3. Register a delegated admin for CloudTrail.
The solution uses the deployment of CloudFormation StackSets with self-managed permissions to set up the resources in the audit account. In order to enable this, create AWSCloudFormationStackSetAdministrationRole in the management account and AWSCloudFormationStackSetExecutionRole in the audit account by using the steps in the topic Grant self-managed permissions.
In a separate AWS account that is different than your multi-account environment, create two PrivateLink VPC endpoint services as explained in the documentation. You can use this template to create a test PrivateLink VPC endpoint service. These will serve as two partner services, one of which is allowed, and another is untrusted and not allowed. Make note of their service names.

Figure 7: Simulated partner services (approved and not approved) in a separate test account

Deploying the solution

Go to the management account of your AWS Organizations multi-account environment and use this CloudFormation template to deploy the solution, or choose the following Launch Stack button:

CloudFormation stacks can be deployed using the AWS CloudFormation console or using the AWS CLI.
This initially displays the Create stack page. Leave the details entered by default, and then choose Next.

On the Specify stack details page, enter the details for the input parameters for this solution. The following table shows the details that you will provide when setting up the CloudFormation template on the Specify stack details page on the CloudFormation console.

AWSOrganizationsId	Identifier for your organization. This can be obtained from your management account as described in the AWS Organizations User Guide.
AdminRoleArn	Role of the persona who is allowed to modify PrivateLink endpoint permissions.
AllowedPrivateLinkAccounts	AWS account IDs of accounts in your OU that host PrivateLink services.
AllowedPrivateLinkPartnerServices	Specify the service name of the approved PrivateLink services from partners. If you want to test with a simulated partner PrivateLink, take the service name of PrivateLink services created in Step 4 of the prerequisites as the partner services to which connections should be allowed. The unique service name of the partner’s PrivateLink service is provided by the partner to the customer so that they can connect to it.
AuditAccountId	AWS account ID of the audit account in your multi-account environment.
PLOrganizationUnit	OU identifier for the organizational unit where the solution will perform preventative and detective control.

Figure 8: CloudFormation template input parameters for the solution as it appears on the console

Choose Next and keep the defaults for the rest of the fields. Then, on the Review and create page, choose Submit to finish deploying the solution.

Testing the solution

Once the solution is deployed successfully, follow these steps to test the solution:

For an account specified in the AllowedPrivateLinkAccounts parameter, create a VPC endpoint service as explained in the topic Create a service powered by AWS PrivateLink. Instead of creating this manually, use this CloudFormation template to create a test VPC endpoint service.
Sign in to a member account within the OU that you specified in the CloudFormation template.
From the member account, create a VPC endpoint connection to the internal PrivateLink service created in the account from Step 1. This connection will set up successfully because it is internal to the organization and therefore allowed by the SCP policy, and is not flagged to the cloud administrator as violating organization policy.
From the member account, create a VPC endpoint connection to the AWS service that is supporting PrivateLink, such as AWS Key Management Service (AWS KMS). This connection will set up successfully because it is internal to the organization and therefore allowed by the SCP policy, and is not flagged to the cloud administrator as violating organization policy.
From the member account, create a VPC endpoint connection to the PrivateLink service created in Step 4 of the prerequisites. This connection will set up successfully because it is internal to the organization and therefore allowed by the SCP policy, and is not flagged to the cloud administrator as violating organization policy.
From the member account, create a VPC endpoint connection to the PrivateLink service created in Step 4 of the prerequisites and that is not an allowed partner service. This connection will fail because it is not allowed by the SCP policy.
From an account outside of your organization, create a VPC endpoint connection to the internal PrivateLink service created in Step 1. The connection setup is successful, but the cloud administrator will see the internal PrivateLink service as NOT COMPLIANT because the connection from external clients is considered to be not compliant with organization requirements in this solution. This information allows the cloud admin to quickly find the noncompliant resource and work with the PrivateLink service owner team to remediate the issue.
From the member account, create another VPC endpoint service without configuring the notification configuration, and leave the Acceptance required field unchecked. Navigate to the AWS Config console in the audit account and go to Aggregator->Rules. Check the evaluation of the rule starting with “OrgConfigRule-pl-governance-rule….” Once the evaluation is complete, it will indicate that this VPC endpoint service is NOT COMPLIANT, whereas the service created in Step 1 will show as COMPLIANT.

Considerations

The solution described here takes the approach of allowing all VPC endpoint connections from within a customer’s organization to the PrivateLink services in specified accounts and detecting and notifying all external ones. This can be modified based on your specific use cases and requirements.
The solution uses AWS Config rules that are applied to specific accounts of your organization, even though the solution is applied at an OU level. The AWS Config rules created in this solution are scoped to evaluate VPC endpoint services and should incur charges accordingly. Refer to the AWS Config pricing page to understand usage-based pricing for the service.
Other services, such AWS Lambda and Amazon EventBridge, also incur usage-based charges. Please verify that these are deleted to prevent incurring unnecessary charges.
SCP policies only affect member accounts. They do not apply to the management account, so actions denied through an SCP policy multi-account will still be allowed in the management account.

Cleanup

You can delete the solution by following these steps to avoid unnecessary charges:

Delete the CloudFormation stack created as part of Step 4 of the prerequisites.
Delete the CloudFormation stack of the main solution deployed in the management account as part of the Deploying the solution section.
Delete the CloudFormation stack created as part of Step 1 of Testing the solution.

Summary

As customers adopt AWS PrivateLink throughout their environment, the mechanisms discussed in this post provide a way for administrators to govern and secure their PrivateLink services at scale. This approach can help you create a scalable solution where interconnections are aligned to the organization’s guidelines and security requirements. While this solution presents an approach to governance, customers can tailor this solution to their unique organizational requirements.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

How to use WhatsApp to send Amazon Cognito notification messages

2024-05-13 Nideesh K T

Post Syndicated from Nideesh K T original https://aws.amazon.com/blogs/security/how-to-use-whatsapp-to-send-amazon-cognito-notification-messages/

While traditional channels like email and SMS remain important, businesses are increasingly exploring alternative messaging services to reach their customers more effectively. In recent years, WhatsApp has emerged as a simple and effective way to engage with users. According to statista, as of 2024, WhatsApp is the most popular mobile messenger app worldwide and has reached over two billion monthly active users in January 2024.

Amazon Cognito lets you add user sign-up and authentication to your mobile and web applications. Among many other features, Cognito provides a custom SMS sender AWS Lambda trigger for using third-party providers to send notifications. In this post, we’ll be using WhatsApp as the third-party provider to send verification codes or multi-factor authentication (MFA) codes instead of SMS during Cognito user pool sign up.

Note: WhatsApp is a third-party service subject to additional terms and charges. Amazon Web Services (AWS) isn’t responsible for third-party services that you use to send messages with a custom SMS sender in Amazon Cognito.

Overview

By default, Amazon Cognito uses Amazon Simple Notification Service (Amazon SNS) for delivery of SMS text messages. Cognito also supports custom triggers that will allow you to invoke an AWS Lambda function to support additional providers such as WhatsApp.

The architecture shown in Figure 1 depicts how to use a custom SMS sender trigger and WhatsApp to send notifications. The steps are as follows:

A user signs up to an Amazon Cognito user pool.
Cognito invokes the custom SMS sender Lambda function and sends the user’s attributes, including the phone number and a one-time code to the Lambda function. This one-time code is encrypted using a custom symmetric encryption AWS Key Management Service (AWS KMS) key that you create.
The Lambda function decrypts the one-time code using a Decrypt API call to your AWS KMS key.
The Lambda function then obtains the WhatsApp access token from AWS Secrets Manager. The WhatsApp access token needs to be generated through Meta Business Settings (which are covered in the next section) and added to Secrets Manager. Lambda also parses the phone number, user attributes, and encrypted secrets.
Lambda sends a POST API call to the WhatsApp API and WhatsApp delivers the verification code to the user as a message. The user can then use the verification code to verify their contact information and confirm the sign-up.

Figure 1: Custom SMS sender trigger flow

Prerequisites

Create an AWS account if you don’t already have one and sign in. The AWS Identity and Access Management (IAM) role that you use must have sufficient permissions to make the necessary AWS service calls and manage AWS resources such as creating and updating Lambda functions, Amazon Cognito user pools, Secrets Manager, AWS KMS keys, and IAM roles.
A Meta (Facebook) developer account. For more details go to the Meta for Developers console.
Git installed.
AWS Cloud Development Kit (AWS CDK) Toolkit installed and configured.
Node.js with NPM installed.
Docker installed and running.

Implementation

In the next steps, we look at how to create a Meta app, create a new system user, get the WhatsApp access token and create the template to send the WhatsApp token.

Create and configure an app for WhatsApp communication

To get started, create a Meta app with WhatsApp added to it, along with the customer phone number that will be used to test.

To create and configure an app

Open the Meta for Developers console, choose My Apps and then choose Create App (or choose an existing Business type app and skip to step 4).
Select Other choose Next and then select Business as the app type and choose Next.
Enter an App name, App contact email, choose whether or not to attach a Business portfolio and choose Create app.
Open the app Dashboard and in the Add product to your app section, under WhatsApp, choose Set up.
Create or select an existing Meta business portfolio and choose Continue.
In the left navigation pane, under WhatsApp, choose API Setup.
Under Send and receive messages, take a note of the Phone number ID, which will be needed in the AWS CDK template later.
Under To, add the customer phone number you want to use for testing. Follow the instructions to add and verify the phone number.

Note: You must have WhatsApp registered with the number and the WhatsApp client installed on your mobile device.

Create a user for accessing WhatsApp

Create a system user in Meta’s Business Manager and assign it to the app created in the previous step. The access tokens generated for this user will be used to make the WhatsApp API calls.

To create a user

Open Meta’s Business Manager and select the business you created or associated your application with earlier from the dropdown menu under Business settings.
Under Users, select System users and then choose Add to create a new system user.
Enter a name for the System Username and set their role as Admin and choose Create system user.
Choose Assign assets.
From the Select asset type list, select Apps. Under Select assets, select your WhatsApp application’s name. Under Partial access, turn on the Test app option for the user. Choose Save Changes and then choose Done.
Choose Generate New Token, select the WhatsApp application created earlier, and leave the default 60 days as the token expiration. Under Permissions select WhatsApp_business_messaging and WhatsApp_business_management and choose Generate Token at the bottom.
Copy and save your access token. You will need this for the AWS CDK template later. Choose OK. For more details on creating the access token, see WhatsApp’s Business Management API Get Started guide.

Create a template in WhatsApp

Create a template for the verification messages that will be sent by WhatsApp.

To create a template

Open Meta’s WhatsApp Manager.
On the left icon pane, under Account tools, choose Message template and then choose Create Template.
Select Authentication as the category.
For the Name, enter otp_message.
For Languages, enter English.
Choose Continue.
In the next screen, select Copy code and choose Submit.

Note: It’s possible that Meta might change the process or the UI. See the Meta documentation for specific details.

For more information on WhatsApp templates, see Create and Manage Templates.

Create a Secrets Manager secret

Use the Secrets Manager console to create a Secrets Manager secret and set the secret to the WhatsApp access token.

To create a secret

Open the AWS Management Console and go to Secrets Manager.

Figure 2: Open the Secrets Manager console
Choose Store a new secret.

Figure 3: Store a new secret
Under Choose a secret type, choose Other type of secret and under Key/value pairs, select the Plaintext tab and enter Bearer followed by the WhatsApp access token (Bearer <WhatsApp access token>).

Figure 4: Add the secret
For the encryption key, you can use either the AWS KMS key that Secrets Manager creates or a customer managed AWS KMS key that you create and then choose Next.
Provide the secret name as the WhatsAppAccessToken, choose Next, and then choose Store to create the secret.
Note the secret Amazon Resource Name (ARN) to use in later steps.

Deploy the solution

In this section, you clone the GitHub repository and deploy the stack to create the resources in your account.

To clone the repository

Create a new directory, navigate to that directory in a terminal and use the following command to clone the GitHub repository that has the Lambda and AWS CDK code:
```
git clone https://github.com/aws-samples/amazon-cognito-whatsapp-otp
```
Change directory to the pattern directory:
```
cd amazon-cognito-whatsapp-otp
```

To deploy the stack

Configure the phone number ID obtained from WhatsApp, the secret name, secret ARN, and the Amazon Cognito user pool self-service sign-up option in the constants.ts file.
Open the lib/constants.ts file and edit the fields. The SELF_SIGNUP value must be set to true for the purpose of this proof of concept. The SELF_SIGNUP value represents the Boolean value for the Amazon Cognito user pool sign-up option, which when set to true allows public users to sign up.
```
export const PHONE_NUMBER_ID = '<phone number ID>'; 
export const SECRET_NAME = '<WhatsAppAccessToken>'; 
export const SECRET_ARN = 'arn:aws:secretsmanager:<AWSRegion>:<phone number ID>:secret:<WhatsAppAccessToken>'; 
export const SELF_SIGNUP = <true>;
```
Warning: If you activate user sign-up (enable self-registration) in your user pool, anyone on the internet can sign up for an account and sign in to your applications.
Install the AWS CDK required dependencies by running the following command:
```
npm install
```
This project uses typescript as the client language for AWS CDK. Run the following command to compile typescript to JavaScript:
```
npm run build
```
From the command line, configure AWS CDK (if you have not already done so):
```
cdk bootstrap <account number>/<AWS Region>
```
Install and run Docker. We’re using the aws-lambda-python-alpha package in the AWS CDK code to build the Lambda deployment package. The deployment package installs the required modules in a Lambda compatible Docker container.
Deploy the stack:
```
cdk synth
cdk deploy --all
```

Test the solution

Now that you’ve completed implementation, it’s time to test the solution by signing up a user on Amazon Cognito and confirming that the Lambda function is invoked and sends the verification code.

To test the solution

Open AWS CloudFormation console.
Select the WhatsappOtpStack that was deployed through AWS CDK.
On the Outputs tab, copy the value of cognitocustomotpsenderclientappid.

Run the following AWS Command Line Interface (AWS CLI) command, replacing the client ID with the output of cognitocustomotpsenderclientappid, username, password, email address, name, phone number, and AWS Region to sign up a new Amazon Cognito user.

aws cognito-idp sign-up --client-id <cognitocustomsmssenderclientappid> --username <TestUserPhoneNumber> --password <Password> --user-attributes Name="email",Value="<TestUserEmail>" Name="name",Value="<TestUserName>" Name="phone_number",Value="<TestPhoneNumber>" --region <AWS Region>

Example:

aws cognito-idp sign-up --client-id xxxxxxxxxxxxxx --username +12065550100  --password Test@654321 --user-attributes Name="email",Value="[email protected]" Name="name",Value="Jane" Name="phone_number",Value=”+12065550100" --region us-east-1

Note: Password requirements are a minimum length of eight characters with at least one number, one lowercase letter, and one special character.

The new user should receive a message on WhatsApp with a verification code that they can use to complete their sign-up.

Cleanup

Run the following command to delete the resources that were created. It might take a few minutes for the CloudFormation stack to be deleted.
```
cdk destroy --all
```
Delete the secret WhatsAppAccessToken that was created from the Secrets Manager console.

Conclusion

In this post, we showed you how to use an alternative messaging platform such as WhatsApp to send notification messages from Amazon Cognito. This functionality is enabled through the Amazon Cognito custom SMS sender trigger, which invokes a Lambda function that has the custom code to send messages through the WhatsApp API. You can use the same method to use other third-party providers to send messages.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito re:Post or contact AWS Support.

Want more AWS Security news? Follow us on X.

Running code after returning a response from an AWS Lambda function

2024-05-07 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/running-code-after-returning-a-response-from-an-aws-lambda-function/

This post is written by Uri Segev, Principal Serverless Specialist SA.

When you invoke an AWS Lambda function synchronously, you expect the function to return a response. For example, this is the case when a client invokes a Lambda function through Amazon API Gateway or from AWS Step Functions. As the client is waiting for the response, you should return the response as soon as possible.

However, there may be instances where you must perform additional work that does not affect the response and you can do it asynchronously, after you send the response. For example, you may store data in a database or send information to a logging system.

Once you send the response from the function, the Lambda service freezes the runtime environment, and the function cannot run additional code. Even if you create a thread for running a task in the background, the Lambda service freezes the runtime environment once the handler returns, causing the thread to freeze until the next invocation. While you can delay returning the response to the client until all work is complete, this approach can negatively impact the user experience.

This blog explores ways to run a task that may start before the function returns but continues running after the function returns the response to the client.

Invoking an asynchronous Lambda function

The first option is to break the code into two functions. The first function runs the synchronous code; the second function runs the asynchronous code. Before the synchronous function returns, it invokes the second function asynchronously, either directly, using the Invoke API, or indirectly, for example, by sending a message to Amazon SQS to trigger the second function.

This Python code demonstrates how to implement this:

import json
import time
import os
import boto3
from aws_lambda_powertools import Logger

logger = Logger()
client = boto3.client('lambda')

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from async"
    }

def submit_async_task(response):
    # Invoke async function to continue
    logger.info(f"[Function] Invoking async task in async function")
    client.invoke_async(FunctionName=os.getenv('ASYNC_FUNCTION'), InvokeArgs=json.dumps(response))

def handler(event, context):
    logger.info(f"[Function] Received event: {json.dumps(event)}")

    response = calc_response(event)
    
    # Done calculating response, submit async task
    submit_async_task(response)

    # Return response to client
    logger.info(f"[Function] Returning response to client")
    return {
        "statusCode": 200,
        "body": json.dumps(response)
    }

The following is the Lambda function that performs the asynchronous work:

import json
import time
from aws_lambda_powertools import Logger

logger = Logger()

def handler(event, context):
    logger.info(f"[Async task] Starting async task: {json.dumps(event)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

Use Lambda response streaming

Response streaming enables developers to start streaming the response as soon as they have the first byte of the response, without waiting for the entire response. You usually use response streaming when you must minimize the Time to First Byte (TTFB) or when you must send a response that is larger than 6 MB (the Lambda response payload size limit).

Using this method, the function can send the response using the response streaming mechanism and can continue running code even after sending the last byte of the response. This way, the client receives the response, and the Lambda function can continue running.

This Node.js code demonstrates how to implement this:

import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger();

export const handler = awslambda.streamifyResponse(async (event, responseStream, _context) => {
    logger.info("[Function] Received event: ", event);
  
    // Do some stuff with event
    let response = await calc_response(event);
    
    // Return response to client
    logger.info("[Function] Returning response to client");
    responseStream.setContentType('application/json');
    responseStream.write(response);
    responseStream.end();

    await async_task(response);   
});

const calc_response = async (event) => {
    logger.info("[Function] Calculating response");
    await sleep(1);  // Simulate sync work

    return {
        message: "hello from streaming"
    };
};

const async_task = async (response) => {
    logger.info("[Async task] Starting async task");
    await sleep(3);  // Simulate async work
    logger.info("[Async task] Done");
};

const sleep = async (sec) => {
    return new Promise((resolve) => {
        setTimeout(resolve, sec * 1000);
    });
};

Use Lambda extensions

Lambda extensions can augment Lambda functions to integrate with your preferred monitoring, observability, security, and governance tools. You can also use an extension to run your own code in the background so that it continues running after your function returns the response to the client.

There are two types of Lambda extensions: external extensions and internal extensions. External extensions run as separate processes in the same execution environment. The Lambda function can communicate with the extension using files in the /tmp folder or using a local network, for example, via HTTP requests. You must package external extensions as a Lambda layer.

Internal extensions run as separate threads within the same process that runs the handler. The handler can communicate with the extension using any in-process mechanism, such as internal queues. This example shows an internal extension, which is a dedicated thread within the handler process.

When the Lambda service invokes a function, it also notifies all the extensions of the invocation. The Lambda service only freezes the execution environment when the Lambda function returns a response and all the extensions signal to the runtime that they are finished. With this approach, the function has the extension run the task independently from the function itself and the extension notifies the Lambda runtime when it is done processing the task. This way, the execution environment stays active until the task is done.

The following Python code example isolates the extension code into its own file and the handler imports and uses it to run the background task:

import json
import time
import async_processor as ap
from aws_lambda_powertools import Logger

logger = Logger()

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from extension"
    }

# This function is performed after the handler code calls submit_async_task 
# and it can continue running after the function returns
def async_task(response):
    logger.info(f"[Async task] Starting async task: {json.dumps(response)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

def handler(event, context):
    logger.info(f"[Function] Received event: {json.dumps(event)}")

    # Calculate response
    response = calc_response(event)

    # Done calculating response
    # call async processor to continue
    logger.info(f"[Function] Invoking async task in extension")
    ap.start_async_task(async_task, response)

    # Return response to client
    logger.info(f"[Function] Returning response to client")
    return {
        "statusCode": 200,
        "body": json.dumps(response)
    }

The following Python code demonstrates how to implement the extension that runs the background task:

import os
import requests
import threading
import queue
from aws_lambda_powertools import Logger

logger = Logger()
LAMBDA_EXTENSION_NAME = "AsyncProcessor"

# An internal queue used by the handler to notify the extension that it can
# start processing the async task.
async_tasks_queue = queue.Queue()

def start_async_processor():
    # Register internal extension
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registering with Lambda service...")
    response = requests.post(
        url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/register",
        json={'events': ['INVOKE']},
        headers={'Lambda-Extension-Name': LAMBDA_EXTENSION_NAME}
    )
    ext_id = response.headers['Lambda-Extension-Identifier']
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registered with ID: {ext_id}")

    def process_tasks():
        while True:
            # Call /next to get notified when there is a new invocation and let
            # Lambda know that we are done processing the previous task.

            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Waiting for invocation...")
            response = requests.get(
                url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/event/next",
                headers={'Lambda-Extension-Identifier': ext_id},
                timeout=None
            )

            # Get next task from internal queue
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Wok up, waiting for async task from handler")
            async_task, args = async_tasks_queue.get()
            
            if async_task is None:
                # No task to run this invocation
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received null task. Ignoring.")
            else:
                # Invoke task
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received async task from handler. Starting task.")
                async_task(args)
            
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Finished processing task")

    # Start processing extension events in a separate thread
    threading.Thread(target=process_tasks, daemon=True, name='AsyncProcessor').start()

# Used by the function to indicate that there is work that needs to be 
# performed by the async task processor
def start_async_task(async_task=None, args=None):
    async_tasks_queue.put((async_task, args))

# Starts the async task processor
start_async_processor()

Use a custom runtime

Lambda supports several runtimes out of the box: Python, Node.js, Java, Dotnet, and Ruby. Lambda also supports custom runtimes, which lets you develop Lambda functions in any other programming language that you need to.

When you invoke a Lambda function that uses a custom runtime, the Lambda service invokes a process called ‘bootstrap’ that contains your custom code. The custom code needs to interact with the Lambda Runtime API. It calls the /next endpoint to obtain information about the next invocation. This API call is blocking and it waits until a request arrives. When the function is done processing the request, it must call the /response endpoint to send the response back to the client and then it must call the /next endpoint again to wait for the next invocation. Lambda freezes the execution environment after you call /next, until a request arrives.

Using this approach, you can run the asynchronous task after calling /response, and sending the response back to the client, and before calling /next, indicating that the processing is done.

The following Python code example isolates the custom runtime code into its own file and the function imports and uses it to interact with the runtime API:

import time
import json
import runtime_interface as rt
from aws_lambda_powertools import Logger

logger = Logger()

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from custom"
    }

def async_task(response):
    logger.info(f"[Async task] Starting async task: {json.dumps(response)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

def main():
    # You can add initialization code here

    # The following loop runs forever waiting for the next invocation
    # and sending the response back to the client
    while True:
        # Call /next to wait for next request (and indicate 
        # that we are done processing the previous request)

        requestId, event = rt.get_next()

        # The code from here to send_response() is the code
        # that usually goes inside the Lambda handler()

        logger.info(f"[Function] Received event: {json.dumps(event)}")

        # Calculate response
        response = calc_response(event)

        # Done calculating response, send response to client
        logger.info(f"[Function] Returning response to client")
        rt.send_response(requestId, {
            "statusCode": 200,
            "body": json.dumps(response)
        })

        logger.info(f"[Function] Invoking async task")
        async_task(response)

main()

This Python code demonstrates how to interact with the runtime API:

import requests
import os
from aws_lambda_powertools import Logger

logger = Logger()
run_time_endpoint = os.environ['AWS_LAMBDA_RUNTIME_API']

def get_next():
    logger.debug("[Custom runtime] Waiting for invocation...")
    request = requests.get(
        url=f"http://{run_time_endpoint}/2018-06-01/runtime/invocation/next",
        timeout=None
    )
    event = request.json()
    requestId = request.headers["Lambda-Runtime-Aws-Request-Id"]
    return requestId, event

def send_response(requestId, response):
    logger.debug("[Custom runtime] Sending response")
    requests.post(
        url=f"http://{run_time_endpoint}/2018-06-01/runtime/invocation/{requestId}/response",
        json = response,
        timeout=None
    )

Conclusion

This blog shows four ways of combining synchronous and asynchronous tasks in a Lambda function, allowing you to run tasks that continue running after the function returns a response to the client. The following table summarizes the pros and cons of each solution:

Function URLs, cannot be used with API Gateway, always public

	Asynchronous invocation	Response streaming	Lambda extensions	Custom runtime
Complexity	Easier to implement	Easiest to implement	The most complex solution to implement as it requires interacting with the extensions API and a dedicated thread	Medium as it interacts with the runtime API
Deployment	Need two artifacts: the synchronous function and the asynchronous function	A single deployment artifact that contains all code	A single deployment artifact that contains all code	A single deployment artifact, requires packaging all needed runtime files
Cost	Most expensive as it incurs additional invocation cost as well as the overall duration of both functions is higher than having it in one	Least expensive	Least expensive	Least expensive
Starting the async task	Before returning from handler	Anytime during the handler invocation	Anytime during the handler invocation	After returning the response to the client, unless you use a dedicated thread
Limitations	Payload sent to the asynchronous function cannot exceed 256 KB	Only supported with Node.js and custom runtimes. Requires Lambda Function URLs, cannot be used with API Gateway, always public	–	–
Additional benefits	Better decoupling between synchronous and asynchronous code	Ability to send response in stages. Supports payloads larger than 6 MB (at additional cost)	The asynchronous task runs in its own thread, which can reduce overall duration and cost	–
Retries in case of failure in async code	Managed by the Lambda service	Responsibility of the developer	Responsibility of the developer	Responsibility of the developer

Choosing the right approach depends on your use case. If you write your function in Node.js and you invoke it using Lambda Function URLs, use response streaming. This is the easiest way to implement, and it is the most cost effective.

If there is a chance for a failure in the asynchronous task (for example, a database is not accessible), and you must ensure that the task completes, use the asynchronous Lambda invocation method. The Lambda service retries your asynchronous function until it succeeds. Eventually, if all retries fail, it invokes a Lambda destination so you can take action.

If you need a custom runtime because you need to use a programming language that Lambda does not natively support, use the custom runtime option. Otherwise, use the Lambda extensions option. It is more complex to implement, but it is cost effective. This allows you to package the code in a single artifact and start processing the asynchronous task before you send the response to the client.

For more serverless learning resources, visit Serverless Land.

Using Amazon Verified Permissions to manage authorization for AWS IoT smart home applications

2024-04-23 Rajat Mathur

Post Syndicated from Rajat Mathur original https://aws.amazon.com/blogs/security/using-amazon-verified-permissions-to-manage-authorization-for-aws-iot-smart-thermostat-applications/

This blog post introduces how manufacturers and smart appliance consumers can use Amazon Verified Permissions to centrally manage permissions and fine-grained authorizations. Developers can offer more intuitive, user-friendly experiences by designing interfaces that align with user personas and multi-tenancy authorization strategies, which can lead to higher user satisfaction and adoption. Traditionally, implementing authorization logic using role based access control (RBAC) or attribute based access control (ABAC) within IoT applications can become complex as the number of connected devices and associated user roles grows. This often leads to an unmanageable increase in access rules that must be hard-coded into each application, requiring excessive compute power for evaluation. By using Verified Permissions, you can externalize the authorization logic using Cedar policy language, enabling you to define fine-grained permissions that combine RBAC and ABAC models. This decouples permissions from your application’s business logic, providing a centralized and scalable way to manage authorization while reducing development effort.

In this post, we walk you through a reference architecture that outlines an end-to-end smart thermostat application solution using AWS IoT Core, Verified Permissions, and other AWS services. We show you how to use Verified Permissions to build an authorization solution using Cedar policy language to define dynamic policy-based access controls for different user personas. The post includes a link to a GitHub repository that houses the code for the web dashboard and the Verified Permissions logic to control access to the solution APIs.

Solution overview

This solution consists of a smart thermostat IoT device and an AWS hosted web application using Verified Permissions for fine-grained access to various application APIs. For this use case, the AWS IoT Core device is being simulated by an AWS Cloud9 environment and communicates with the IoT service using AWS IoT Device SDK for Python. After being configured, the device connects to AWS IoT Core to receive commands and send messages to various MQTT topics.

As a general practice, when a user-facing IoT solution is implemented, the manufacturer performs administrative tasks such as:

Embedding AWS Private Certificate Authority certificates into each IoT device (in this case a smart thermostat). Usually this is done on the assembly line and the certificates used to verify the IoT endpoints are burned into device memory along with the firmware.
Creating an Amazon Cognito user pool that provides sign-up and sign-in options for web and mobile application users and hosts the authentication process.
Creating policy stores and policy templates in Verified Permissions. Based on who signs up, the manufacturer creates policies with Verified Permissions to link each signed-up user to certain allowed resources or IoT devices.
The mapping of user to device is stored in a datastore. For this solution, you’ll use an Amazon DynamoDB table to record the relationship.

The user who purchases the device (the primary device owner) performs the following tasks:

Signs up on the manufacturer’s web application or mobile app and registers the IoT device by entering a unique serial number. The mapping between user details and the device serial number is stored in the datastore through an automated process that is initiated after sign-up and device claim.
Connects the new device to an existing wireless network, which initiates a registration process to securely connect to AWS IoT Core services within the manufacturer’s account.
Invites other users (such as guests, family members, or the power company) through a referral, invitation link, or a designated OAuth process.
Assign roles to the other users and therefore permissions.

Figure 1: Sample smart home application architecture built using AWS services

Figure 1 depicts the solution as three logical components:

The first component depicts device operations through AWS IoT Core. The smart thermostat is on site and it communicates with AWS IoT Core and its state is managed through the AWS IoT Device Shadow Service.
The second component depicts the web application, which is the application interface that customers use. It’s a ReactJS-backed single page application deployed using AWS Amplify.
The third component shows the backend application, which is built using Amazon API Gateway, AWS Lambda, and DynamoDB. A Cognito user pool is used to manage application users and their authentication. Authorization is handled by Verified Permissions where you create and manage policies that are evaluated when the web application calls backend APIs. These policies are evaluated against each authorization policy to provide an access decision to deny or allow an action.

The solution flow itself can be broken down into three steps after the device is onboarded and users have signed up:

The smart thermostat device connects and communicates with AWS IoT Core using the MQTT protocol. A classic Device Shadow is created for the AWS IoT thing Thermostat1 when the UpdateThingShadow call is made the first time through the AWS SDK for a new device. AWS IoT Device Shadow service lets the web application query and update the device’s state in case of connectivity issues.
Users sign up or sign in to the Amplify hosted smart home application and authenticate themselves against a Cognito user pool. They’re mapped to a device, which is stored in a DynamoDB table.
After the users sign in, they’re allowed to perform certain tasks and view certain sections of the dashboard based on the different roles and policies managed by Verified Permissions. The underlying Lambda function that’s responsible for handling the API calls queries the DynamoDB table to provide user context to Verified Permissions.

Prerequisites

To deploy this solution, you need access to the AWS Management Console and AWS Command Line Interface (AWS CLI) on your local machine with sufficient permissions to access required services, including Amplify, Verified Permissions, and AWS IoT Core. For this solution, you’ll give the services full access to interact with different underlying services. But in production, we recommend following security best practices with AWS Identity and Access Management (IAM), which involves scoping down policies.
Set up Amplify CLI by following these instructions. We recommend the latest NodeJS stable long-term support (LTS) version. At the time of publishing this post, the LTS version was v20.11.1. Users can manage multiple NodeJS versions on their machines by using a tool such as Node Version Manager (nvm).

Walkthrough

The following table describes the actions, resources, and authorization decisions that will be enforced through Verified Permissions policies to achieve fine-grained access control. In this example, John is the primary device owner and has purchased and provisioned a new smart thermostat device called Thermostat1. He has invited Jane to access his device and has given her restricted permissions. John has full control over the device whereas Jane is only allowed to read the temperature and set the temperature between 72°F and 78°F.

John has also decided to give his local energy provider (Power Company) access to the device so that they can set the optimum temperature during the day to manage grid load and offer him maximum savings on his energy bill. However, they can only do so between 2:00 PM and 5:00 PM.

For security purposes the verified permissions default decision is DENY for unauthorized principals.

Name	Principal	Action	Resource	Authorization decision
Any	Default	Default	Default	Deny
John	john_doe	Any	Thermostat1	Allow
Jane	jane_doe	GetTemperature	Thermostat1	Allow
Jane	jane_doe	SetTemperature	Thermostat1	Allow only if desired temperature is between 72°F and 78°F.
Power Company	powercompany	GetTemperature	Thermostat1	Allow only if accessed between the hours of 2:00 PM and 5:00 PM
Power Company	powercompany	SetTemperature	Thermostat1	Allow only if the temperature is set between the hours of 2:00 PM and 5:00 PM

Create a Verified Permissions policy store

Verified Permissions is a scalable permissions management and fine-grained authorization service for the applications that you build. The policies are created using Cedar, a dedicated language for defining access permissions in applications. Cedar seamlessly integrates with popular authorization models such as RBAC and ABAC.

A policy is a statement that either permits or forbids a principal to take one or more actions on a resource. A policy store is a logical container that stores your Cedar policies, schema, and principal sources. A schema helps you to validate your policy and identify errors based on the definitions you specify. See Cedar schema to learn about the structure and formal grammar of a Cedar schema.

To create the policy store

Sign in to the Amazon Verified Permissions console and choose Create policy store.
In the Configuration Method section, select Empty Policy Store and choose Create policy store.

Figure 2: Create an empty policy store

Note: Make a note of the policy store ID to use when you deploy the solution.

To create a schema for the application

On the Verified Permissions page, select Schema.
In the Schema section, choose Create schema.

Figure 3: Create a schema

In the Edit schema section, choose JSON mode, paste the following sample schema for your application, and choose Save changes.

{
    "AwsIotAvpWebApp": {
        "entityTypes": {
            "Device": {
                "shape": {
                    "attributes": {
                        "primaryOwner": {
                            "name": "User",
                            "required": true,
                            "type": "Entity"
                        }
                    },
                    "type": "Record"
                },
                "memberOfTypes": []
            },
            "User": {}
        },
        "actions": {
            "GetTemperature": {
                "appliesTo": {
                    "context": {
                        "attributes": {
                            "desiredTemperature": {
                                "type": "Long"
                            },
                            "time": {
                                "type": "Long"
                            }
                        },
                        "type": "Record"
                    },
                    "resourceTypes": [
                        "Device"
                    ],
                    "principalTypes": [
                        "User"
                    ]
                }
            },
            "SetTemperature": {
                "appliesTo": {
                    "resourceTypes": [
                        "Device"
                    ],
                    "principalTypes": [
                        "User"
                    ],
                    "context": {
                        "attributes": {
                            "desiredTemperature": {
                                "type": "Long"
                            },
                            "time": {
                                "type": "Long"
                            }
                        },
                        "type": "Record"
                    }
                }
            }
        }
    }
}

When creating policies in Cedar, you can define authorization rules using a static policy or a template-linked policy.

Static policies

In scenarios where a policy explicitly defines both the principal and the resource, the policy is categorized as a static policy. These policies are immediately applicable for authorization decisions, as they are fully defined and ready for implementation.

Template-linked policies

On the other hand, there are situations where a single set of authorization rules needs to be applied across a variety of principals and resources. Consider an IoT application where actions such as SetTemperature and GetTemperature must be permitted for specific devices. Using static policies for each unique combination of principal and resource can lead to an excessive number of almost identical policies, differing only in their principal and resource components. This redundancy can be efficiently addressed with policy templates. Policy templates allow for the creation of policies using placeholders for the principal, the resource, or both. After a policy template is established, individual policies can be generated by referencing this template and specifying the desired principal and resource. These template-linked policies function the same as static policies, offering a streamlined and scalable solution for policy management.

To create a policy that allows access to the primary owner of the device using a static policy

In the Verified Permissions console, on the left pane, select Policies, then choose Create policy and select Create static policy from the drop-down menu.

Figure 4: Create static policy
Define the policy scope:
1. Select Permit for the Policy effect.
  
  Figure 5: Define policy effect
2. Select All Principals for Principals scope.
3. Select All Resources for Resource scope.
4. Select All Actions for Actions scope and choose Next.
  
  Figure 6: Define policy scope
On the Details page, under Policy, paste the following full-access policy, which grants the primary owner permission to perform both SetTemperature and GetTemperature actions on the smart thermostat unconditionally. Choose Create policy.
```
	permit (principal, action, resource)
	when { resource.primaryOwner == principal };
```
Figure 7: Write and review policy statement

To create a static policy to allow a guest user to read the temperature

In this example, the guest user is Jane (username: jane_doe).

Create another static policy and specify the policy scope.
1. Select Permit for the Policy effect.
  
  Figure 8: Define the policy effect
2. Select Specific principal for the Principals scope.
3. Select AwsIotAvpWebApp::User and enter jane_doe.
  
  Figure 9: Define the policy scope
4. Select Specific resource for the Resources scope.
5. Select AwsIotAvpWebApp::Device and enter Thermostat1.
6. Select Specific set of actions for the Actions scope.
7. Select GetTemperature and choose Next.
  
  Figure 10: Define resource and action scopes
8. Enter the Policy description: Allow jane_doe to read thermostat1.
9. Choose Create policy.

Next, you will create reusable policy templates to manage policies efficiently. To create a policy template for a guest user with restricted temperature settings that limit the temperature range they can set to between 72°F and 78°F. In this case, the guest user is going to be Jane (username: jane_doe)

To create a reusable policy template

Select Policy template and enter Guest user template as the description.

Paste the following sample policy in the Policy body and choose Create policy template.

permit (
    principal == ?principal,
    action in [AwsIotAvpWebApp::Action::"SetTemperature"],
    resource == ?resource
)
when { context.desiredTemperature >= 72 && context.desiredTemperature <= 78 };

Figure 11: Create guest user policy template

As you can see, you don’t specify the principal and resource yet. You enter those when you create an actual policy from the policy template. The context object will be populated with the desiredTemperature property in the application and used to evaluate the decision.

You also need to create a policy template for the Power Company user with restricted time settings. Cedar policies don’t support date/time format, so you must represent 2:00 PM and 5:00 PM as elapsed minutes from midnight.

To create a policy template for the power company

Select Policy template and enter Power company user template as the description.

Paste the following sample policy in the Policy body and choose Create policy template.

permit (
    principal == ?principal,
    action in [AwsIotAvpWebApp::Action::"SetTemperature", AwsIotAvpWebApp::Action::"GetTemperature"],
    resource == ?resource
)
when { context.time >= 840 && context.time < 1020 };

The policy templates accept the user and resource. The next step is to create a template-linked policy for Jane to set and get thermostat readings based on the Guest user template that you created earlier. For simplicity, you will manually create this policy using the Verified Permissions console. In production, application policies can be dynamically created using the Verified Permissions API.

To create a template-linked policy for a guest user

In the Verified Permissions console, on the left pane, select Policies, then choose Create policy and select Create template-linked policy from the drop-down menu.

Figure 12: Create new template-linked policy
Select the Guest user template and choose next.

Figure 13: Select Guest user template
Under parameter selection:
1. For Principal enter AwsIotAvpWebApp::User::”jane_doe”.
2. For Resource enter AwsIotAvpWebApp::Device::”Thermostat1″.
3. Choose Create template-linked policy.
  
  Figure 14: Create guest user template-linked policy

Note that with this policy in place, jane_doe can only set the temperature of the device Thermostat1 to between 72°F and 78°F.

To create a template-linked policy for the power company user

Based on the template that was set up for power company, you now need an actual policy for it.

In the Verified Permissions console, go to the left pane and select Policies, then choose Create policy and select Create template-linked policy from the drop-down menu.
Select the Power company user template and choose next.
Under Parameter selection, for Principal enter AwsIotAvpWebApp::User::”powercompany”, and for Resource enter AwsIotAvpWebApp::Device::”Thermostat1″, and choose Create template-linked policy.

Now that you have a set of policies in a policy store, you need to update the backend codebase to include this information and then deploy the web application using Amplify.

The policy statements in this post intentionally use human-readable values such as jane_doe and powercompany for the principal entity. This is useful when discussing general concepts but in production systems, customers should use unique and immutable values for entities. See Get the best out of Amazon Verified Permissions by using fine-grained authorization methods for more information.

Deploy the solution code from GitHub

Go to the GitHub repository to set up the Amplify web application. The repository Readme file provides detailed instructions on how to set up the web application. You will need your Verified Permissions policy store ID to deploy the application. For convenience, we’ve provided an onboarding script—deploy.sh—which you can use to deploy the application.

To deploy the application

Close the repository.

git clone https://github.com/aws-samples/amazon-verified-permissions-iot-
amplify-smart-home-application.git

Deploy the application.

./deploy.sh <region> <Verified Permissions Policy Store ID>

After the web dashboard has been deployed, you’ll create an IoT device using AWS IoT Core.

Create an IoT device and connect it to AWS IoT Core

With the users, policies, and templates, and the Amplify smart home application in place, you can now create a device and connect it to AWS IoT Core to complete the solution.

To create Thermostat1” device and connect it to AWS IoT Core

From the left pane in the AWS IoT console, select Connect one device.

Figure 15: Connect device using AWS IoT console
Review how IoT Thing works and then choose Next.

Figure 16: Review how IoT Thing works before proceeding
Choose Create a new thing and enter Thermostat1 as the Thing name and choose next.
&bsp;

Figure 17: Create the new IoT thing
Select Linux/macOS as the Device platform operating system and Python as the AWS IoT Core Device SDK and choose next.

Figure 18: Choose the platform and SDK for the device
Choose Download connection kit and choose next.

Figure 19: Download the connection kit to use for creating the Thermostat1 device
Review the three steps to display messages from your IoT device. You will use them to verify the thermostat1 IoT device connectivity to the AWS IoT Core platform. They are:
1. Step 1: Add execution permissions
2. Step 2: Run the start script
3. Step 3: Return to the AWS IoT Console to view the device’s message
  
  Figure 20: How to display messages from an IoT device

Solution validation

With all of the pieces in place, you can now test the solution.

Primary owner signs in to the web application to set Thermostat1 temperature to 82°F

Figure 21: Thermostat1 temperature update by John

Sign in to the Amplify web application as John. You should be able to view the Thermostat1 controller on the dashboard.
Set the temperature to 82°F.
The Lambda function processes the request and performs an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on the policies. Verified Permissions sends back an ALLOW, as the policy that was previously set up allows unrestricted access for primary owners.
Upon receiving the response from Verified Permissions, the Lambda function sends ALLOW permission back to the web application and an API call to the AWS IoT Device Shadow service to update the device (Thermostat1) temperature to 82°F.

Figure 22: Policy evaluation decision is ALLOW when a primary owner calls SetTemperature

Guest user signs in to the web application to set Thermostat1 temperature to 80°F

Figure 23: Thermostat1 temperature update by Jane

If you sign in as Jane to the Amplify web application, you can view the Thermostat1 controller on the dashboard.
Set the temperature to 80°F.
The Lambda function validates the actions by sending an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on the established policies. Verified Permissions sends back a DENY, as the policy only permits temperature adjustments between 72°F and 78°F.
Upon receiving the response from Verified Permissions, the Lambda function sends DENY permissions back to the web application and an unauthorized response is returned.

Figure 24: Guest user jane_doe receives a DENY when calling SetTemperature for a desired temperature of 80°F
If you repeat the process (still as Jane) but set Thermostat1 to 75°F, the policy will cause the request to be allowed.

Figure 25: Guest user jane_doe receives an ALLOW when calling SetTemperature for a desired temperature of 75°F
Similarly, jane_doe is allowed run GetTemperature on the device Thermostat1. When the temperature is set to 74°F, the device shadow is updated. The IoT device being simulated by your AWS Cloud9 instance reads desired the temperature field and sets the reported value to 74.
Now, when jane_doe runs GetTemperature, the value of the device is reported as 74 as shown in Figure 26. We encourage you to try different restrictions in the World Settings (outside temperature and time) by adding restrictions to the static policy that allows GetTemperature for guest user.

Figure 26: Guest user jane_doe receives an ALLOW when calling GetTemperature for the reported temperature

Power company signs in to the web application to set Thermostat1 to 78°F at 3.30 PM

Figure 27: Thermostat1 temperature set to 78°F by powercompany user at a specified time

Sign in as the powercompany user to the Amplify web application using an API. You can view the Thermostat1 controller on the dashboard.
To test this scenario, set the current time to 3:30 PM, and try to set the temperature to 78°F.
The Lambda function validates the actions by sending an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on pre-established policies. Verified Permissions returns ALLOW permission, because the policy for powercompany permits device temperature changes between 2:00 PM and 5:00 PM.
Upon receiving the response from Verified Permissions, the Lambda function sends ALLOW permission back to the web application and an API call to the AWS IoT Device Shadow service to update the Thermostat1 temperature to 78°F.

Figure 28: powercompany receives an ALLOW when SetTemperature is called with the desired temperature of 78°F

Note: As an optional exercise, we also made jane_doe a device owner for device Thermostat2. This can be observed in the users.json file in the Github repository. We encourage you to create your own policies and restrict functions for Thermostat2 after going through this post. You will need to create separate Verified Permissions policies and update the Lambda functions to interact with these policies.

We encourage you to create policies for guests and the power company and restrict permissions based on the following criteria:

Verify Jane Doe can perform GetTemperature and SetTemperature actions on Thermostat2.
John Doe should not be able to set the temperature on device Thermostat2 outside of the time range of 4:00 PM and 6:00 PM and outside of the temperature range of 68°F and 72°F.
Power Company can only perform the GetTemperature operation, but there are no restrictions on time and outside temperature.

To help you verify the solution, we’ve provided the correct policies under the challenge directory in the GitHub repository.

Clean up

Deploying the Thermostat application in your AWS account will incur costs. To avoid ongoing charges, when you’re done examining the solution, delete the resources that were created. This includes the Amplify hosted web application, API Gateway resource, AWS Cloud 9 environment, the Lambda function, DynamoDB table, Cognito user pool, AWS IoT Core resources, and Verified Permissions policy store.

Amplify resources can be deleted by going to the AWS CloudFormation console and deleting the stacks that were used to provision various services.

Conclusion

In this post, you learned about creating and managing fine-grained permissions using Verified Permissions for different user personas for your smart thermostat IoT device. With Verified Permissions, you can strengthen your security posture and build smart applications aligned with Zero Trust principles for real-time authorization decisions. To learn more, we recommend:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Serverless IoT email capture, attachment processing, and distribution

2024-04-18 Stacy Conant

Post Syndicated from Stacy Conant original https://aws.amazon.com/blogs/messaging-and-targeting/serverless-iot-email-capture-attachment-processing-and-distribution/

Many customers need to automate email notifications to a broad and diverse set of email recipients, sometimes from a sensor network with a variety of monitoring capabilities. Many sensor monitoring software products include an SMTP client to achieve this goal. However, managing email server infrastructure requires specialty expertise and operating an email server comes with additional cost and inherent risk of breach, spam, and storage management. Organizations also need to manage distribution of attachments, which could be large and potentially contain exploits or viruses. For IoT use cases, diagnostic data relevance quickly expires, necessitating retention policies to regularly delete content.

Solution Overview

This solution uses the Amazon Simple Email Service (SES) SMTP interface to receive SMTP client messages, and processes the message to replace an attachment with a pre-signed URL in the resulting email to its intended recipients. Attachments are stored separately in an Amazon Simple Storage Service (S3) bucket with a lifecycle policy implemented. This reduces the storage requirements of recipient email server receiving notification emails. Additionally, this solution leverages built-in anti-spam and security scanning capabilities to deal with spam and potentially malicious attachments while at the same time providing the mechanism by which pre-signed attachment links can be revoked should the emails be distributed to unintended recipients.

The solution uses:

Amazon SES SMTP interface to receive incoming emails.
Amazon SES receipt rule on a (sub)domain controlled by administrators, to store raw incoming emails in an Amazon S3 bucket.
AWS Lambda function, triggered on S3 ObjectCreated event, to process raw emails, extract attachments, replace each with pre-signed URL with configurable expiry, and send the processed emails to intended recipients.

Solution Flow Details:

SMTP client transmits email content to an email address in a (sub) domain with MX record set to Amazon SES service’s regional endpoint.
Amazon SES SMTP interface receives an email and forwards it to SES Receipt Rule(s) for processing.
A matching Amazon SES Receipt Rule saves incoming email into an Amazon S3 Bucket.
Amazon S3 Bucket emits an S3 ObjectCreated Event, and places the event onto the Amazon Simple Queue Services (SQS) queue.
The AWS Lambda service polls the inbound messages’ SQS queue and feeds events to the Lambda function.
The Lambda function, retrieves email files from the S3 bucket, parses the email sender/subject/body, saves attachments to a separate attachment S3 bucket (7), and replaces attachments with pre-signed URLs in the email body. The Lambda function then extracts intended recipient addresses from the email body. If the body contains properly formatted recipients list, email is then sent using SES API (9), otherwise a notice is posted to a fallback Amazon Simple Notification Service (SNS) Topic (8).
The Lambda function saves extracted attachments, if any, into an attachments bucket.
Malformed email notifications are posted to a fallback Amazon SNS Topic.
The Lambda function invokes Amazon SES API to send the processed email to all intended recipient addresses.
If the Lambda function is unable to process email successfully, the inbound message is placed on to the SQS dead-letter queue (DLQ) queue for later intervention by the operator.
SES delivers an email to each recipients’ mail server.
Intended recipients download emails from their corporate mail servers and retrieve attachments from the S3 pre-signed URL(s) embedded in the email body.
An alarm is triggered and a notification is published to Amazon SNS Alarms Topic whenever:
- More than 50 failed messages are in the DLQ.
- Oldest message on incoming SQS queue is older than 3 minutes – unable to keep up with inbound messages (flooding).
- The incoming SQS queue contains over 180 messages (configurable) over 5 minutes old.

Setting up Amazon SES

For this solution you will need an email account where you can receive emails. You’ll also need a (sub)domain for which you control the mail exchanger (MX) record. You can obtain your (sub)domain either from Amazon Route53 or another domain hosting provider.

Verify the sender email address

You’ll need to follow the instructions to Verify an email address for all identities that you use as “From”, “Source”, ” Sender”, or “Return-Path” addresses. You’ll also need to follow these instructions for any identities you wish to send emails to during initial testing while your SES account is in the “Sandbox” (see next “Moving out of the SES Sandbox” section).

Moving out of the SES Sandbox

Amazon SES accounts are “in the Sandbox” by default, limiting email sending only to verified identities. AWS does this to prevent fraud and abuse as well as protecting your reputation as an email sender. When your account leaves the Sandbox, SES can send email to any recipient, regardless of whether the recipient’s address or domain is verified by SES. However, you still have to verify all identities that you use as “From”, “Source”, “Sender”, or “Return-Path” addresses.
Follow the Moving out of the SES Sandbox instructions in the SES Developer Guide. Approval is usually within 24 hours.

Set up the SES SMTP interface

Follow the workshop lab instructions to set up email sending from your SMTP client using the SES SMTP interface. Once you’ve completed this step, your SMTP client can open authenticated sessions with the SES SMTP interface and send emails. The workshop will guide you through the following steps:

Create SMTP credentials for your SES account.
- IMPORTANT: Never share SMTP credentials with unauthorized individuals. Anyone with these credentials can send as many SMTP requests and in whatever format/content they choose. This may result in end-users receiving emails with malicious content, administrative/operations overload, and unbounded AWS charges.
Test your connection to ensure you can send emails.
Authenticate using the SMTP credentials generated in step 1 and then send a test email from an SMTP client.

Verify your email domain and bounce notifications with Amazon SES

In order to replace email attachments with a pre-signed URL and other application logic, you’ll need to set up SES to receive emails on a domain or subdomain you control.

Verify the domain that you want to use for receiving emails.
Publish a mail exchanger record (MX record) and include the Amazon SES inbound receiving endpoint for your AWS region ( e.g. inbound-smtp.us-east-1.amazonaws.com for US East Northern Virginia) in the domain DNS configuration.
Amazon SES automatically manages the bounce notifications whenever recipient email is not deliverable. Follow the Set up notifications for bounces and complaints guide to setup bounce notifications.

Deploying the solution

The solution is implemented using AWS CDK with Python. First clone the solution repository to your local machine or Cloud9 development environment. Then deploy the solution by entering the following commands into your terminal:

python -m venv .venv
. ./venv/bin/activate
pip install -r requirements.txt

cdk deploy \
--context SenderEmail=<verified sender email> \
 --context RecipientEmail=<recipient email address> \
 --context ConfigurationSetName=<configuration set name>

Note:

The RecipientEmail CDK context parameter in the cdk deploy command above can be any email address in the domain you verified as part of the Verify the domain step. In other words, if the verified domain is acme-corp.com, then the emails can be [email protected], [email protected], etc.

The ConfigurationSetName CDK context can be obtained by navigating to Identities in Amazon SES console, selecting the verified domain (same as above), switching to “Configuration set” tab and selecting name of the “Default configuration set”

After deploying the solution, please, navigate to Amazon SES Email receiving in AWS console, edit the rule set and set it to Active.

Testing the solution end-to-end

Create a small file and generate a base64 encoding so that you can attach it to an SMTP message:

echo content >> demo.txt
cat demo.txt | base64 > demo64.txt
cat demo64.txt

Install openssl (which includes an SMTP client capability) using the following command:

sudo yum install openssl

Now run the SMTP client (openssl is used for the proof of concept, be sure to complete the steps in the workshop lab instructions first):

openssl s_client -crlf -quiet -starttls smtp -connect email-smtp.<aws-region>.amazonaws.com:587

and feed in the commands (replacing the brackets [] and everything between them) to send the SMTP message with the attachment you created.

EHLO amazonses.com
AUTH LOGIN
[base64 encoded SMTP user name]
[base64 encoded SMTP password]
MAIL FROM:[VERIFIED EMAIL IN SES]
RCPT TO:[VERIFIED EMAIL WITH SES RECEIPT RULE]
DATA
Subject: Demo from openssl
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text
Content-Type: text/plain

Line1:This is a Test email sent to coded list of email addresses using the Amazon SES SMTP interface from openssl SMTP client.
Line2:Email_Rxers_Code:[ANYUSER1@DOMAIN_A,ANYUSER2@DOMAIN_B,ANYUSERX@DOMAIN_Y]:Email_Rxers_Code:
Line3:Last line.

--XXXXboundary text
Content-Type: text/plain;
Content-Transfer-Encoding: Base64
Content-Disposition: attachment; filename="demo64.txt"
Y29udGVudAo=
--XXXXboundary text
.
QUIT

Note: For base64 SMTP username and password above, use values obtained in Set up the SES SMTP interface, step 1. So for example, if the username is AKZB3LJAF5TQQRRPQZO1, then you can obtain base64 encoded value using following command:

echo -n AKZB3LJAF5TQQRRPQZO1 |base64
QUtaQjNMSkFGNVRRUVJSUFFaTzE=

This makes base64 encoded value QUtaQjNMSkFGNVRRUVJSUFFaTzE= Repeat same process for SMTP username and password values in the example above.

The openssl command should result in successful SMTP authentication and send. You should receive an email that looks like this:

Optimizing Security of the Solution

Do not share DNS credentials. Unauthorized access can lead to domain control, potential denial of service, and AWS charges. Restrict access to authorized personnel only.
Do not set the SENDER_EMAIL environment variable to the email address associated with the receipt rule. This address is a closely guarded secret, known only to administrators, and should be changed frequently.
Review access to your code repository regularly to ensure there are no unauthorized changes to your code base.
Utilize Permissions Boundaries to restrict the actions permitted by an IAM user or role.

Cleanup

To cleanup, start by navigating to Amazon SES Email receiving in AWS console, and setting the rule set to Inactive.

Once completed, delete the stack:

cdk destroy

Cleanup AWS SES Access Credentials

In Amazon SES Console, select Manage existing SMTP credentials, select the username for which credentials were created in Set up the SES SMTP interface above, navigate to the Security credentials tab and in the Access keys section, select Action -> Delete to delete AWS SES access credentials.

Troubleshooting

If you are not receiving the email or email is not being sent correctly there are a number of common causes of these errors:

HTTP Error 554 Message rejected email address is not verified. The following identities failed the check in region :
- This means that you have attempted to send an email from address that has not been verified.
- Please, ensure that the “MAIL FROM:[VERIFIED EMAIL IN SES]” email address sent via openssl matches the SenderEmail=<verified sender email> email address used in cdk deploy.
- Also make sure this email address was used in Verify the sender email address step.
Email is not being delivered/forwarded
- The incoming S3 bucket under the incoming prefix, contains file called AMAZON_SES_SETUP_NOTIFICATION. This means that MX record of the domain setup is missing. Please, validate that the MX record (step 2) of Verify your email domain with Amazon SES to receive emails section is fully configured.
- Please ensure after deploying the Amazon SES solution, the created rule set was made active by navigating to Amazon SES Email receiving in AWS console, and set it to Active.
- This may mean that the destination email address has bounced. Please, navigate to Amazon SES Suppression list in AWS console ensure that recipient’s email is not in the suppression list. If it is listed, you can see the reason in the “Suppression reason” column. There you may either manually remove from the suppression list or if the recipient email is not valid, consider using a different recipient email address.

AWS Legal Disclaimer: Sample code, software libraries, command line tools, proofs of concept, templates, or other related technology are provided as AWS Content or Third-Party Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content or Third-Party Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content or Third-Party Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content or Third-Party Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

About the Authors

Tarek Soliman

Tarek is a Senior Solutions Architect at AWS. His background is in Software Engineering with a focus on distributed systems. He is passionate about diving into customer problems and solving them. He also enjoys building things using software, woodworking, and hobby electronics.

Dave Spencer

Dave is a Senior Solutions Architect at AWS. His background is in cloud solutions architecture, Infrastructure as Code (Iac), systems engineering, and embedded systems programming. Dave’s passion is developing partnerships with Department of Defense customers to maximize technology investments and realize their strategic vision.

Ayman Ishimwe

Ayman is a Solutions Architect at AWS based in Seattle, Washington. He holds a Master’s degree in Software Engineering and IT from Oakland University. With prior experience in software development, specifically in building microservices for distributed web applications, he is passionate about helping customers build robust and scalable solutions on AWS cloud services following best practices.

Dmytro Protsiv

Dmytro is a Cloud Applications Architect for with Amazon Web Services. He is passionate about helping customers to solve their business challenges around application modernization.

Stacy Conant

Stacy is a Solutions Architect working with DoD and US Navy customers. She enjoys helping customers understand how to harness big data and working on data analytics solutions. On the weekends, you can find Stacy crocheting, reading Harry Potter (again), playing with her dogs and cooking with her husband.

AWS Weekly Roundup: New features on Knowledge Bases for Amazon Bedrock, OAC for Lambda function URL origins on Amazon CloudFront, and more (April 15, 2024)

2024-04-15 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-new-features-on-knowledge-bases-for-amazon-bedrock-oac-for-lambda-function-url-origins-on-amazon-cloudfront-and-more-april-15-2024/

AWS Community Days conferences are in full swing with AWS communities around the globe. The AWS Community Day Poland was hosted last week with more than 600 cloud enthusiasts in attendance. Community speakers Agnieszka Biernacka, Krzysztof Kąkol, and more, presented talks which captivated the audience and resulted in vibrant discussions throughout the day. My teammate, Wojtek Gawroński, was at the event and he’s already looking forward to attending again next year!

Last week’s launches
Here are some launches that got my attention during the previous week.

Amazon CloudFront now supports Origin Access Control (OAC) for Lambda function URL origins – Now you can protect your AWS Lambda URL origins by using Amazon CloudFront Origin Access Control (OAC) to only allow access from designated CloudFront distributions. The CloudFront Developer Guide has more details on how to get started using CloudFront OAC to authenticate access to Lambda function URLs from your designated CloudFront distributions.

AWS Client VPN and AWS Verified Access migration and interoperability patterns – If you’re using AWS Client VPN or a similar third-party VPN-based solution to provide secure access to your applications today, you’ll be pleased to know that you can now combine the use of AWS Client VPN and AWS Verified Access for your new or existing applications.

These two announcements related to Knowledge Bases for Amazon Bedrock caught my eye:

Metadata filtering to improve retrieval accuracy – With metadata filtering, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chunks based on applied metadata filters and associated values.

Custom prompts for the RetrieveAndGenerate API and configuration of the maximum number of retrieved results – These are two new features which you can now choose as query options alongside the search type to give you control over the search results. These are retrieved from the vector store and passed to the Foundation Models for generating the answer.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
AWS open source news and updates – My colleague Ricardo writes this weekly open source newsletter in which he highlights new open source projects, tools, and demos from the AWS Community.

Upcoming AWS events
AWS Summits – These are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Whether you’re in the Americas, Asia Pacific & Japan, or EMEA region, learn here about future AWS Summit events happening in your area.

AWS Community Days – Join an AWS Community Day event just like the one I mentioned at the beginning of this post to participate in technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from your area. If you’re in Kenya, or Nepal, there’s an event happening in your area this coming weekend.

You can browse all upcoming in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

– Veliswa

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS.

Accelerate security automation using Amazon CodeWhisperer

2024-04-15 Brendan Jenkins

Post Syndicated from Brendan Jenkins original https://aws.amazon.com/blogs/security/accelerate-security-automation-using-amazon-codewhisperer/

In an ever-changing security landscape, teams must be able to quickly remediate security risks. Many organizations look for ways to automate the remediation of security findings that are currently handled manually. Amazon CodeWhisperer is an artificial intelligence (AI) coding companion that generates real-time, single-line or full-function code suggestions in your integrated development environment (IDE) to help you quickly build software. By using CodeWhisperer, security teams can expedite the process of writing security automation scripts for various types of findings that are aggregated in AWS Security Hub, a cloud security posture management (CSPM) service.

In this post, we present some of the current challenges with security automation and walk you through how to use CodeWhisperer, together with Amazon EventBridge and AWS Lambda, to automate the remediation of Security Hub findings. Before reading further, please read the AWS Responsible AI Policy.

Current challenges with security automation

Many approaches to security automation, including Lambda and AWS Systems Manager Automation, require software development skills. Furthermore, the process of manually writing code for remediation can be a time-consuming process for security professionals. To help overcome these challenges, CodeWhisperer serves as a force multiplier for qualified security professionals with development experience to quickly and effectively generate code to help remediate security findings.

Security professionals should still cultivate software development skills to implement robust solutions. Engineers should thoroughly review and validate any generated code, as manual oversight remains critical for security.

Solution overview

Figure 1 shows how the findings that Security Hub produces are ingested by EventBridge, which then invokes Lambda functions for processing. The Lambda code is generated with the help of CodeWhisperer.

Figure 1: Diagram of the solution

Security Hub integrates with EventBridge so you can automatically process findings with other services such as Lambda. To begin remediating the findings automatically, you can configure rules to determine where to send findings. This solution will do the following:

Ingest an Amazon Security Hub finding into EventBridge.
Use an EventBridge rule to invoke a Lambda function for processing.
Use CodeWhisperer to generate the Lambda function code.

It is important to note that there are two types of automation for Security Hub finding remediation:

Partial automation, which is initiated when a human worker selects the Security Hub findings manually and applies the automated remediation workflow to the selected findings.
End-to-end automation, which means that when a finding is generated within Security Hub, this initiates an automated workflow to immediately remediate without human intervention.

Important: When you use end-to-end automation, we highly recommend that you thoroughly test the efficiency and impact of the workflow in a non-production environment first before moving forward with implementation in a production environment.

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

An AWS account
Visual Studio Code (VS Code) or supported JetBrains IDEs
Python
CodeWhisperer enabled locally in your IDE
AWS Config enabled
Security Hub enabled, with the NIST Special Publication 800-53 Revision 5 standard selected

Implement security automation

In this scenario, you have been tasked with making sure that versioning is enabled across all Amazon Simple Storage Service (Amazon S3) buckets in your AWS account. Additionally, you want to do this in a way that is programmatic and automated so that it can be reused in different AWS accounts in the future.

To do this, you will perform the following steps:

Generate the remediation script with CodeWhisperer
Create the Lambda function
Integrate the Lambda function with Security Hub by using EventBridge
Create a custom action in Security Hub
Create an EventBridge rule to target the Lambda function
Run the remediation

Generate a remediation script with CodeWhisperer

The first step is to use VS Code to create a script so that CodeWhisperer generates the code for your Lambda function in Python. You will use this Lambda function to remediate the Security Hub findings generated by the [S3.14] S3 buckets should use versioning control.

Note: The underlying model of CodeWhisperer is powered by generative AI, and the output of CodeWhisperer is nondeterministic. As such, the code recommended by the service can vary by user. By modifying the initial code comment to prompt CodeWhisperer for a response, customers can change the corresponding output to help meet their needs. Customers should subject all code generated by CodeWhisperer to typical testing and review protocols to verify that it is free of errors and is in line with applicable organizational security policies. To learn about best practices on prompt engineering with CodeWhisperer, see this AWS blog post.

To generate the remediation script

Open a new VS Code window, and then open or create a new folder for your file to reside in.
Create a Python file called cw-blog-remediation.py as shown in Figure 2.

Figure 2: New VS Code file created called cw-blog-remediation.py
Add the following imports to the Python file.
```
import json
import boto3
```
Because you have the context added to your file, you can now prompt CodeWhisperer by using a natural language comment. In your file, below the import statements, enter the following comment and then press Enter.
```
# Create lambda function that turns on versioning for an S3 bucket after the function is triggered from Amazon EventBridge
```
Accept the first recommendation that CodeWhisperer provides by pressing Tab to use the Lambda function handler, as shown in Figure 3.
&ngsp;

Figure 3: Generation of Lambda handler

To get the recommendation for the function from CodeWhisperer, press Enter. Make sure that the recommendation you receive looks similar to the following. CodeWhisperer is nondeterministic, so its recommendations can vary.

import json
import boto3

# Create lambda function that turns on versioning for an S3 bucket after function is triggered from Amazon EventBridge
def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket = event['detail']['requestParameters']['bucketName']
    response = s3.put_bucket_versioning(
        Bucket=bucket,
        VersioningConfiguration={
            'Status': 'Enabled'
        }
    )
    print(response)
    return {
        'statusCode': 200,
        'body': json.dumps('Versioning enabled for bucket ' + bucket)
    }

Take a moment to review the user actions and keyboard shortcut keys. Press Tab to accept the recommendation.
You can change the function body to fit your use case. To get the Amazon Resource Name (ARN) of the S3 bucket from the EventBridge event, replace the bucket variable with the following line:
```
bucket = event['detail']['findings'][0]['Resources'][0]['Id']
```

To prompt CodeWhisperer to extract the bucket name from the bucket ARN, use the following comment:

# Take the S3 bucket name from the ARN of the S3 bucket

Your function code should look similar to the following:

import json
import boto3

# Create lambda function that turns on versioning for an S3 bucket after function is triggered from Amazon EventBridge
def lambda_handler(event, context):
    s3 = boto3.client('s3')
   bucket = event['detail']['findings'][0]['Resources'][0]['Id']
         # Take the S3 bucket name from the ARN of the S3 bucket
   bucket = bucket.split(':')[5]

    response = s3.put_bucket_versioning(
        Bucket=bucket,
        VersioningConfiguration={
            'Status': 'Enabled'
        }
    )
    print(response)
    return {
        'statusCode': 200,
        'body': json.dumps('Versioning enabled for bucket ' + bucket)
    }

Create a .zip file for cw-blog-remediation.py. Find the file in your local file manager, right-click the file, and select compress/zip. You will use this .zip file in the next section of the post.

Create the Lambda function

The next step is to use the automation script that you generated to create the Lambda function that will enable versioning on applicable S3 buckets.

To create the Lambda function

Open the AWS Lambda console.
In the left navigation pane, choose Functions, and then choose Create function.
Select Author from Scratch and provide the following configurations for the function:
1. For Function name, select sec_remediation_function.
2. For Runtime, select Python 3.12.
3. For Architecture, select x86_64.
4. For Permissions, select Create a new role with basic Lambda permissions.
Choose Create function.
To upload your local code to Lambda, select Upload from and then .zip file, and then upload the file that you zipped.
Verify that you created the Lambda function successfully. In the Code source section of Lambda, you should see the code from the automation script displayed in a new tab, as shown in Figure 4.

Figure 4: Source code that was successfully uploaded
Choose the Code tab.
Scroll down to the Runtime settings pane and choose Edit.
For Handler, enter cw-blog-remediation.lambda_handler for your function handler, and then choose Save, as shown in Figure 5.

Figure 5: Updated Lambda handler
For security purposes, and to follow the principle of least privilege, you should also add an inline policy to the Lambda function’s role to perform the tasks necessary to enable versioning on S3 buckets.
1. In the Lambda console, navigate to the Configuration tab and then, in the left navigation pane, choose Permissions. Choose the Role name, as shown in Figure 6.
  
  Figure 6: Lambda role in the AWS console
2. In the Add permissions dropdown, select Create inline policy.
  
  Figure 7: Create inline policy
3. Choose JSON, add the following policy to the policy editor, and then choose Next.
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:PutBucketVersioning",
            "Resource": "*"
        }
    ]
}
```
4. Name the policy PutBucketVersioning and choose Create policy.

Create a custom action in Security Hub

In this step, you will create a custom action in Security Hub.

To create the custom action

Open the Security Hub console.
In the left navigation pane, choose Settings, and then choose Custom actions.
Choose Create custom action.
Provide the following information, as shown in Figure 8:
- For Name, enter TurnOnS3Versioning.
- For Description, enter Action that will turn on versioning for a specific S3 bucket.
- For Custom action ID, enter TurnOnS3Versioning.
  
  Figure 8: Create a custom action in Security Hub
Choose Create custom action.
Make a note of the Custom action ARN. You will need this ARN when you create a rule to associate with the custom action in EventBridge.

Create an EventBridge rule to target the Lambda function

The next step is to create an EventBridge rule to capture the custom action. You will define an EventBridge rule that matches events (in this case, findings) from Security Hub that were forwarded by the custom action that you defined previously.

To create the EventBridge rule

Navigate to the EventBridge console.
On the right side, choose Create rule.
On the Define rule detail page, give your rule a name and description that represents the rule’s purpose—for example, you could use the same name and description that you used for the custom action. Then choose Next.
Scroll down to Event pattern, and then do the following:
1. For Event source, make sure that AWS services is selected.
2. For AWS service, select Security Hub.
3. For Event type, select Security Hub Findings – Custom Action.
4. Select Specific custom action ARN(s) and enter the ARN for the custom action that you created earlier.
Figure 9: Specify the EventBridge event pattern for the Security Hub custom action workflow

As you provide this information, the Event pattern updates.
Choose Next.
On the Select target(s) step, in the Select a target dropdown, select Lambda function. Then from the Function dropdown, select sec_remediation_function.
Choose Next.
On the Configure tags step, choose Next.
On the Review and create step, choose Create rule.

Run the automation

Your automation is set up and you can now test the automation. This test covers a partial automation workflow, since you will manually select the finding and apply the remediation workflow to one or more selected findings.

Important: As we mentioned earlier, if you decide to make the automation end-to-end, you should assess the impact of the workflow in a non-production environment. Additionally, you may want to consider creating preventative controls if you want to minimize the risk of event occurrence across an entire environment.

To run the automation

In the Security Hub console, on the Findings tab, add a filter by entering Title in the search box and selecting that filter. Select IS and enter S3 general purpose buckets should have versioning enabled (case sensitive). Choose Apply.
In the filtered list, choose the Title of an active finding.
Before you start the automation, check the current configuration of the S3 bucket to confirm that your automation works. Expand the Resources section of the finding.
Under Resource ID, choose the link for the S3 bucket. This opens a new tab on the S3 console that shows only this S3 bucket.
In your browser, go back to the Security Hub tab (don’t close the S3 tab—you will need to return to it), and on the left side, select this same finding, as shown in Figure 10.

Figure 10: Filter out Security Hub findings to list only S3 bucket-related findings
In the Actions dropdown list, choose the name of your custom action.

Figure 11: Choose the custom action that you created to start the remediation workflow
When you see a banner that displays Successfully started action…, go back to the S3 browser tab and refresh it. Verify that the S3 versioning configuration on the bucket has been enabled as shown in figure 12.

Figure 12: Versioning successfully enabled

Conclusion

In this post, you learned how to use CodeWhisperer to produce AI-generated code for custom remediations for a security use case. We encourage you to experiment with CodeWhisperer to create Lambda functions that remediate other Security Hub findings that might exist in your account, such as the enforcement of lifecycle policies on S3 buckets with versioning enabled, or using automation to remove multiple unused Amazon EC2 elastic IP addresses. The ability to automatically set public S3 buckets to private is just one of many use cases where CodeWhisperer can generate code to help you remediate Security Hub findings.

To sum up, CodeWhisperer acts as a tool that can help boost the productivity of security experts who have coding abilities, assisting them to swiftly write code to address security issues. However, security specialists should continue building their software development capabilities to implement robust solutions. Engineers should carefully review and test any generated code, since human oversight is still vital for security.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Detecting and remediating inactive user accounts with Amazon Cognito

2024-04-09 Harun Abdi

Post Syndicated from Harun Abdi original https://aws.amazon.com/blogs/security/detecting-and-remediating-inactive-user-accounts-with-amazon-cognito/

For businesses, particularly those in highly regulated industries, managing user accounts isn’t just a matter of security but also a compliance necessity. In sectors such as finance, healthcare, and government, where regulations often mandate strict control over user access, disabling stale user accounts is a key compliance activity. In this post, we show you a solution that uses serverless technologies to track and disable inactive user accounts. While this process is particularly relevant for those in regulated industries, it can also be beneficial for other organizations looking to maintain a clean and secure user base.

The solution focuses on identifying inactive user accounts in Amazon Cognito and automatically disabling them. Disabling a user account in Cognito effectively restricts the user’s access to applications and services linked with the Amazon Cognito user pool. After their account is disabled, the user cannot sign in, access tokens are revoked for their account and they are unable to perform API operations that require user authentication. However, the user’s data and profile within the Cognito user pool remain intact. If necessary, the account can be re-enabled, allowing the user to regain access and functionality.

While the solution focuses on the example of a single Amazon Cognito user pool in a single account, you also learn considerations for multi-user pool and multi-account strategies.

Solution overview

In this section, you learn how to configure an AWS Lambda function that captures the latest sign-in records of users authenticated by Amazon Cognito and write this data to an Amazon DynamoDB table. A time-to-live (TTL) indicator is set on each of these records based on the user inactivity threshold parameter defined when deploying the solution. This TTL represents the maximum period a user can go without signing in before their account is disabled. As these items reach their TTL expiry in DynamoDB, a second Lambda function is invoked to process the expired items and disable the corresponding user accounts in Cognito. For example, if the user inactivity threshold is configured to be 7 days, the accounts of users who don’t sign in within 7 days of their last sign-in will be disabled. Figure 1 shows an overview of the process.

Note: This solution functions as a background process and doesn’t disable user accounts in real time. This is because DynamoDB Time to Live (TTL) is designed for efficiency and to remain within the constraints of the Amazon Cognito quotas. Set your users’ and administrators’ expectations accordingly, acknowledging that there might be a delay in the reflection of changes and updates.

Figure 1: Architecture diagram for tracking user activity and disabling inactive Amazon Cognito users

As shown in Figure 1, this process involves the following steps:

An application user signs in by authenticating to Amazon Cognito.
Upon successful user authentication, Cognito initiates a post authentication Lambda trigger invoking the PostAuthProcessorLambda function.
The PostAuthProcessorLambda function puts an item in the LatestPostAuthRecordsDDB DynamoDB table with the following attributes:
1. sub: A unique identifier for the authenticated user within the Amazon Cognito user pool.
2. timestamp: The time of the user’s latest sign-in, formatted in UTC ISO standard.
3. username: The authenticated user’s Cognito username.
4. userpool_id: The identifier of the user pool to which the user authenticated.
5. ttl: The TTL value, in seconds, after which a user’s inactivity will initiate account deactivation.
Items in the LatestPostAuthRecordsDDB DynamoDB table are automatically purged upon reaching their TTL expiry, launching events in DynamoDB Streams.
DynamoDB Streams events are filtered to allow invocation of the DDBStreamProcessorLambda function only for TTL deleted items.
The DDBStreamProcessorLambda function runs to disable the corresponding user accounts in Cognito.

Implementation details

In this section, you’re guided through deploying the solution, demonstrating how to integrate it with your existing Amazon Cognito user pool and exploring the solution in more detail.

Note: This solution begins tracking user activity from the moment of its deployment. It can’t retroactively track or manage user activities that occurred prior to its implementation. To make sure the solution disables currently inactive users in the first TTL period after deploying the solution, you should do a one-time preload of those users into the DynamoDB table. If this isn’t done, the currently inactive users won’t be detected because users are detected as they sign in. For the same reason, users who create accounts but never sign in won’t be detected either. To detect user accounts that sign up but never sign in, implement a post confirmation Lambda trigger to invoke a Lambda function that processes user sign-up records and writes them to the DynamoDB table.

Prerequisites

Before deploying this solution, you must have the following prerequisites in place:

An existing Amazon Cognito user pool. This user pool is the foundation upon which the solution operates. If you don’t have a Cognito user pool set up, you must create one before proceeding. See Creating a user pool.
The ability to launch a CloudFormation template. The second prerequisite is the capability to launch an AWS CloudFormation template in your AWS environment. The template provisions the necessary AWS services, including Lambda functions, a DynamoDB table, and AWS Identity and Access Management (IAM) roles that are integral to the solution. The template simplifies the deployment process, allowing you to set up the entire solution with minimal manual configuration. You must have the necessary permissions in your AWS account to launch CloudFormation stacks and provision these services.

To deploy the solution

Choose the following Launch Stack button to deploy the solution’s CloudFormation template:

The solution deploys in the AWS US East (N. Virginia) Region (us-east-1) by default. To deploy the solution in a different Region, use the Region selector in the console navigation bar and make sure that the services required for this walkthrough are supported in your newly selected Region. For service availability by Region, see AWS Services by Region.
On the Quick Create Stack screen, do the following:
1. Specify the stack details.
  1. Stack name: The stack name is an identifier that helps you find a particular stack from a list of stacks. A stack name can contain only alphanumeric characters (case sensitive) and hyphens. It must start with an alphabetic character and can’t be longer than 128 characters.
  2. CognitoUserPoolARNs: A comma-separated list of Amazon Cognito user pool Amazon Resource Names (ARNs) to monitor for inactive users.
  3. UserInactiveThresholdDays: Time (in days) that the user account is allowed to be inactive before it’s disabled.
2. Scroll to the bottom, and in the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
3. Choose Create Stack.

Integrate with your existing user pool

With the CloudFormation template deployed, you can set up Lambda triggers in your existing user pool. This is a key step for tracking user activity.

Note: This walkthrough is using the new AWS Management Console experience. Alternatively, These steps could also be done using CloudFormation.

To integrate with your existing user pool

Navigate to the Amazon Cognito console and select your user pool.
Navigate to User pool properties.
Under Lambda triggers, choose Add Lambda trigger. Select the Authentication radio button, then add a Post authentication trigger and assign the PostAuthProcessorLambda function.

Note: Amazon Cognito allows you to set up one Lambda trigger per event. If you already have a configured post authentication Lambda trigger, you can refactor the existing Lambda function, adding new features directly to minimize the cold starts associated with invoking additional functions (for more information, see Anti-patterns in Lambda-based applications). Keep in mind that when Cognito calls your Lambda function, the function must respond within 5 seconds. If it doesn’t and if the call can be retried, Cognito retries the call. After three unsuccessful attempts, the function times out. You can’t change this 5-second timeout value.

Figure 2: Add a post-authentication Lambda trigger and assign a Lambda function

When you add a Lambda trigger in the Amazon Cognito console, Cognito adds a resource-based policy to your function that permits your user pool to invoke the function. When you create a Lambda trigger outside of the Cognito console, including a cross-account function, you must add permissions to the resource-based policy of the Lambda function. Your added permissions must allow Cognito to invoke the function on behalf of your user pool. You can add permissions from the Lambda console or use the Lambda AddPermission API operation. To configure this in CloudFormation, you can use the AWS::Lambda::Permission resource.

Explore the solution

The solution should now be operational. It’s configured to begin monitoring user sign-in activities and automatically disable inactive user accounts according to the user inactivity threshold. Use the following procedures to test the solution:

Note: When testing the solution, you can set the UserInactiveThresholdDays CloudFormation parameter to 0. This minimizes the time it takes for user accounts to be disabled.

Step 1: User authentication

Create a user account (if one doesn’t exist) in the Amazon Cognito user pool integrated with the solution.
Authenticate to the Cognito user pool integrated with the solution.

Figure 3: Example user signing in to the Amazon Cognito hosted UI

Step 2: Verify the sign-in record in DynamoDB

Confirm the sign-in record was successfully put in the LatestPostAuthRecordsDDB DynamoDB table.

Navigate to the DynamoDB console.
Select the LatestPostAuthRecordsDDB table.
Select Explore Table Items.
Locate the sign-in record associated with your user.

Figure 4: Locating the sign-in record associated with the signed-in user

Step 3: Confirm user deactivation in Amazon Cognito

After the TTL expires, validate that the user account is disabled in Amazon Cognito.

Navigate to the Amazon Cognito console.
Select the relevant Cognito user pool.
Under Users, select the specific user.
Verify the Account status in the User information section.

Figure 5: Screenshot of the user that signed in with their account status set to disabled

Note: TTL typically deletes expired items within a few days. Depending on the size and activity level of a table, the actual delete operation of an expired item can vary. TTL deletes items on a best effort basis, and deletion might take longer in some cases.

The user’s account is now disabled. A disabled user account can’t be used to sign in, but still appears in the responses to GetUser and ListUsers API requests.

Design considerations

In this section, you dive deeper into the key components of this solution.

DynamoDB schema configuration:

The DynamoDB schema has the Amazon Cognito sub attribute as the partition key. The Cognito sub is a globally unique user identifier within Cognito user pools that cannot be changed. This configuration ensures each user has a single entry in the table, even if the solution is configured to track multiple user pools. See Other considerations for more about tracking multiple user pools.

Using DynamoDB Streams and Lambda to disable TTL deleted users

This solution uses DynamoDB TTL and DynamoDB Streams alongside Lambda to process user sign-in records. The TTL feature automatically deletes items past their expiration time without write throughput consumption. The deleted items are captured by DynamoDB Streams and processed using Lambda. You also apply event filtering within the Lambda event source mapping, ensuring that the DDBStreamProcessorLambda function is invoked exclusively for TTL-deleted items (see the following code example for the JSON filter pattern). This approach reduces invocations of the Lambda functions, simplifies code, and reduces overall cost.

{
    "Filters": [
        {
            "Pattern": { "userIdentity": { "type": ["Service"], "principalId": ["dynamodb.amazonaws.com"] } }
        }
    ]
}

Handling API quotas:

The DDBStreamProcessorLambda function is configured to comply with the AdminDisableUser API’s quota limits. It processes messages in batches of 25, with a parallelization factor of 1. This makes sure that the solution remains within the nonadjustable 25 requests per second (RPS) limit for AdminDisableUser, avoiding potential API throttling. For more details on these limits, see Quotas in Amazon Cognito.

Dead-letter queues:

Throughout the architecture, dead-letter queues (DLQs) are used to handle message processing failures gracefully. They make sure that unprocessed records aren’t lost but instead are queued for further inspection and retry.

Other considerations

The following considerations are important for scaling the solution in complex environments and maintaining its integrity. The ability to scale and manage the increased complexity is crucial for successful adoption of the solution.

Multi-user pool and multi-account deployment

While this solution discussed a single Amazon Cognito user pool in a single AWS account, this solution can also function in environments with multiple user pools. This involves deploying the solution and integrating with each user pool as described in Integrating with your existing user pool. Because of the AdminDisableUser API’s quota limit for the maximum volume of requests in one AWS Region in one AWS account, consider deploying the solution separately in each Region in each AWS account to stay within the API limits.

Efficient processing with Amazon SQS:

Consider using Amazon Simple Queue Service (Amazon SQS) to add a queue between the PostAuthProcessorLambda function and the LatestPostAuthRecordsDDB DynamoDB table to optimize processing. This approach decouples user sign-in actions from DynamoDB writes, and allows for batching writes to DynamoDB, reducing the number of write requests.

Clean up

Avoid unwanted charges by cleaning up the resources you’ve created. To decommission the solution, follow these steps:

Remove the Lambda trigger from the Amazon Cognito user pool:
1. Navigate to the Amazon Cognito console.
2. Select the user pool you have been working with.
3. Go to the Triggers section within the user pool settings.
4. Manually remove the association of the Lambda function with the user pool events.
Remove the CloudFormation stack:
1. Open the CloudFormation console.
2. Locate and select the CloudFormation stack that was used to deploy the solution.
3. Delete the stack.
4. CloudFormation will automatically remove the resources created by this stack, including Lambda functions, Amazon SQS queues, and DynamoDB tables.

Conclusion

In this post, we walked you through a solution to identify and disable stale user accounts based on periods of inactivity. While the example focuses on a single Amazon Cognito user pool, the approach can be adapted for more complex environments with multiple user pools across multiple accounts. For examples of Amazon Cognito architectures, see the AWS Architecture Blog.

Proper planning is essential for seamless integration with your existing infrastructure. Carefully consider factors such as your security environment, compliance needs, and user pool configurations. You can modify this solution to suit your specific use case.

Maintaining clean and active user pools is an ongoing journey. Continue monitoring your systems, optimizing configurations, and keeping up-to-date on new features. Combined with well-architected preventive measures, automated user management systems provide strong defenses for your applications and data.

For further reading, see the AWS Well-Architected Security Pillar and more posts like this one on the AWS Security Blog.

If you have feedback about this post, submit comments in the Comments section. If you have questions about this post, start a new thread on the Amazon Cognito re:Post forum or contact AWS Support.

AWS Weekly Roundup: Amazon EC2 G6 instances, Mistral Large on Amazon Bedrock, AWS Deadline Cloud, and more (April 8, 2024)

2024-04-08 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-mistral-large-aws-clean-rooms-ml-aws-deadline-cloud-and-more-april-8-2024/

We’re just two days away from AWS Summit Sydney (April 10–11) and a month away from the AWS Summit season in Southeast Asia, starting with the AWS Summit Singapore (May 7) and the AWS Summit Bangkok (May 30). If you happen to be in Sydney, Singapore, or Bangkok around those dates, please join us.

Last Week’s Launches
If you haven’t read last week’s Weekly Roundup yet, Channy wrote about the AWS Chips Taste Test, a new initiative from Jeff Barr as part of April’ Fools Day.

Here are some launches that caught my attention last week:

New Amazon EC2 G6 instances — We announced the general availability of Amazon EC2 G6 instances powered by NVIDIA L4 Tensor Core GPUs. G6 instances can be used for a wide range of graphics-intensive and machine learning use cases. G6 instances deliver up to 2x higher performance for deep learning inference and graphics workloads compared to Amazon EC2 G4dn instances. To learn more, visit the Amazon EC2 G6 instance page.

Mistral Large is now available in Amazon Bedrock — Veliswa wrote about the availability of the Mistral Large foundation model, as part of the Amazon Bedrock service. You can use Mistral Large to handle complex tasks that require substantial reasoning capabilities. In addition, Amazon Bedrock is now available in the Paris AWS Region.

Amazon Aurora zero-ETL integration with Amazon Redshift now in additional Regions — Zero-ETL integration announcements were my favourite launches last year. This Zero-ETL integration simplifies the process of transferring data between the two services, allowing customers to move data between Amazon Aurora and Amazon Redshift without the need for manual Extract, Transform, and Load (ETL) processes. With this announcement, Zero-ETL integrations between Amazon Aurora and Amazon Redshift is now supported in 11 additional Regions.

Announcing AWS Deadline Cloud — If you’re working in films, TV shows, commercials, games, and industrial design and handling complex rendering management for teams creating 2D and 3D visual assets, then you’ll be excited about AWS Deadline Cloud. This new managed service simplifies the deployment and management of render farms for media and entertainment workloads.

AWS Clean Rooms ML is Now Generally Available — Last year, I wrote about the preview of AWS Clean Rooms ML. In that post, I elaborated a new capability of AWS Clean Rooms that helps you and your partners apply machine learning (ML) models on your collective data without copying or sharing raw data with each other. Now, AWS Clean Rooms ML is available for you to use.

Knowledge Bases for Amazon Bedrock now supports private network policies for OpenSearch Serverless — Here’s exciting news for you who are building with Amazon Bedrock. Now, you can implement Retrieval-Augmented Generation (RAG) with Knowledge Bases for Amazon Bedrock using Amazon OpenSearch Serverless (OSS) collections that have a private network policy.

Amazon EKS extended support for Kubernetes versions now generally available — If you’re running Kubernetes version 1.21 and higher, with this Extended Support for Kubernetes, you can stay up-to-date with the latest Kubernetes features and security improvements on Amazon EKS.

AWS Lambda Adds Support for Ruby 3.3 — Coding in Ruby? Now, AWS Lambda supports Ruby 3.3 as its runtime. This update allows you to take advantage of the latest features and improvements in the Ruby language.

Amazon EventBridge Console Enhancements — The Amazon EventBridge console has been updated with new features and improvements, making it easier for you to manage your event-driven applications with a better user experience.

Private Access to the AWS Management Console in Commercial Regions — If you need to restrict access to personal AWS accounts from the company network, you can use AWS Management Console Private Access. With this launch, you can use AWS Management Console Private Access in all commercial AWS Regions.

From community.aws
The community.aws is a home for us, builders, to share our learnings with building on AWS. Here’s my Top 3 posts from last week:

Other AWS News
Here are some additional news items, open-source projects, and Twitch shows that you might find interesting:

Build On Generative AI – Join Tiffany and Darko to learn more about generative AI, see their demos and discuss different aspects of generative AI with the guest speakers. Streaming every Monday on Twitch, 9:00 AM US PT.

AWS open source news and updates – If you’re looking for various open-source projects and tools from the AWS community, please read the AWS open-source newsletter maintained by my colleague, Ricardo.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Amsterdam (April 9), Sydney (April 10–11), London (April 24), Singapore (May 7), Berlin (May 15–16), Seoul (May 16–17), Hong Kong (May 22), Milan (May 23), Dubai (May 29), Thailand (May 30), Stockholm (June 4), and Madrid (June 5).

AWS re:Inforce – Explore cloud security in the age of generative AI at AWS re:Inforce, June 10–12 in Pennsylvania for two-and-a-half days of immersive cloud security learning designed to help drive your business initiatives.

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Poland (April 11), Bay Area (April 12), Kenya (April 20), and Turkey (May 18).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Serverless ICYMI Q1 2024

2024-04-01 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-q1-2024/

Welcome to the 25th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

2024 Q1 calendar

Adobe Summit

At the Adobe Summit, the AWS Serverless Developer Advocacy team showcased a solution developed for the NFL using AWS serverless technologies and Adobe Photoshop APIs. The system automates image processing tasks, including background removal and dynamic resizing, by integrating AWS Step Functions, AWS Lambda, Amazon EventBridge, and AI/ML capabilities via Amazon Rekognition. This solution reduced image processing time from weeks to minutes and saved the NFL significant costs. Combining cloud-based serverless architectures with advanced machine learning and API technologies can optimize digital workflows for cost-effective and agile digital asset management.

Adobe Summit ServerlessVideo

ServerlessVideo is a demo application to stream live videos and also perform advanced post-video processing. It uses several AWS services, including Step Functions, Lambda, EventBridge, Amazon ECS, and Amazon Bedrock in a serverless architecture that makes it fast, flexible, and cost-effective. The team used ServerlessVideo to interview attendees about the conference experience and Adobe and partners about how they use Adobe. Learn more about the project and watch videos from Adobe Summit 2024 at video.serverlessland.com.

AWS Lambda

AWS launched support for the latest long-term support release of .NET 8, which includes API enhancements, improved Native Ahead of Time (Native AOT) support, and improved performance.

AWS Lambda .NET 8

Learn how to compare design approaches for building serverless microservices. This post covers the trade-offs to consider with various application architectures. See how you can apply single responsibility, Lambda-lith, and read and write functions.

The AWS Serverless Java Container has been updated. This makes it easier to modernize a legacy Java application written with frameworks such as Spring, Spring Boot, or JAX-RS/Jersey in Lambda with minimal code changes.

AWS Serverless Java Container

Lambda has improved the responsiveness for configuring Event Source Mappings (ESMs) and Amazon EventBridge Pipes with event sources such as self-managed Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon DocumentDB, and Amazon MQ.

Chaos engineering is a popular practice for building confidence in system resilience. However, many existing tools assume the ability to alter infrastructure configurations, and cannot be easily applied to the serverless application paradigm. You can use the AWS Fault Injection Service (FIS) to automate and manage chaos experiments across different Lambda functions to provide a reusable testing method.

Amazon ECS and AWS Fargate

Amazon Elastic Container Service (Amazon ECS) now provides managed instance draining as a built-in feature of Amazon ECS capacity providers. This allows Amazon ECS to safely and automatically drain tasks from Amazon Elastic Compute Cloud (Amazon EC2) instances that are part of an Amazon EC2 Auto Scaling Group associated with an Amazon ECS capacity provider. This simplification allows you to remove custom lifecycle hooks previously used to drain Amazon EC2 instances. You can now perform infrastructure updates such as rolling out a new version of the ECS agent by seamlessly using Auto Scaling Group instance refresh, with Amazon ECS ensuring workloads are not interrupted.

Credentials Fetcher makes it easier to run containers that depend on Windows authentication when using Amazon EC2. Credentials Fetcher now integrates with Amazon ECS, using either the Amazon EC2 launch type, or AWS Fargate serverless compute launch type.

Amazon ECS Service Connect is a networking capability to simplify service discovery, connectivity, and traffic observability for Amazon ECS. You can now more easily integrate certificate management to encrypt service-to-service communication using Transport Layer Security (TLS). You do not need to modify your application code, add additional network infrastructure, or operate service mesh solutions.

Amazon ECS Service Connect

Running distributed machine learning (ML) workloads on Amazon ECS allows ML teams to focus on creating, training and deploying models, rather than spending time managing the container orchestration engine. Amazon ECS provides a great environment to run ML projects as it supports workloads that use NVIDIA GPUs and provides optimized images with pre-installed NVIDIA Kernel drivers and Docker runtime.

See how to build preview environments for Amazon ECS applications with AWS Copilot. AWS Copilot is an open source command line interface that makes it easier to build, release, and operate production ready containerized applications.

Learn techniques for automatic scaling of your Amazon Elastic Container Service (Amazon ECS) container workloads to enhance the end user experience. This post explains how to use AWS Application Auto Scaling which helps you configure automatic scaling of your Amazon ECS service. You can also use Amazon ECS Service Connect and AWS Distro for OpenTelemetry (ADOT) in Application Auto Scaling.

AWS Step Functions

AWS workloads sometimes require access to data stored in on-premises databases and storage locations. Traditional solutions to establish connectivity to the on-premises resources require inbound rules to firewalls, a VPN tunnel, or public endpoints. Discover how to use the MQTT protocol (AWS IoT Core) with AWS Step Functions to dispatch jobs to on-premises workers to access or retrieve data stored on-premises.

You can use Step Functions to orchestrate many business processes. Many industries are required to provide audit trails for decision and transactional systems. Learn how to build a serverless pipeline to create a reliable, performant, traceable, and durable pipeline for audit processing.

Amazon EventBridge

Amazon EventBridge now supports publishing events to AWS AppSync GraphQL APIs as native targets. The new integration allows you to publish events easily to a wider variety of consumers and simplifies updating clients with near real-time data.

Amazon EventBridge publishing events to AWS AppSync

Discover how to send and receive CloudEvents with EventBridge. CloudEvents is an open-source specification for describing event data in a common way. You can publish CloudEvents directly to EventBridge, filter and route them, and use input transformers and API Destinations to send CloudEvents to downstream AWS services and third-party APIs.

AWS Application Composer

AWS Application Composer lets you create infrastructure as code templates by dragging and dropping cards on a virtual canvas. These represent CloudFormation resources, which you can wire together to create permissions and references. Application Composer has now expanded to the VS Code IDE as part of the AWS Toolkit. This now includes a generative AI partner that helps you write infrastructure as code (IaC) for all 1100+ AWS CloudFormation resources that Application Composer now supports.

AWS AppComposer generate suggestions

Amazon API Gateway

Learn how to consume private Amazon API Gateway APIs using mutual TLS (mTLS). mTLS helps prevent man-in-the-middle attacks and protects against threats such as impersonation attempts, data interception, and tampering.

Serverless at AWS re:Invent

Serverless at AWS reInvent

Visit the Serverless Land YouTube channel to find a list of serverless and serverless container sessions from reinvent 2023. Hear from experts like Chris Munns and Julian Wood in their popular session, Best practices for serverless developers, or Nathan Peck and Jessica Deen in Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate.