Tag Archives: Amazon Simple Storage Service (S3)

Amazon FSx for OpenZFS now supports Amazon S3 access without any data movement

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/amazon-fsx-for-openzfs-now-supports-amazon-s3-access-without-any-data-movement/

Starting today, you can attach Amazon S3 Access Points to your Amazon FSx for OpenZFS file systems to access your file data as if it were in Amazon Simple Storage Service (Amazon S3). With this new capability, your data in FSx for OpenZFS is accessible for use with a broad range of Amazon Web Services (AWS) services and applications for artificial intelligence, machine learning (ML), and analytics that work with S3. Your file data continues to reside in your FSx for OpenZFS file system.

Organizations store hundreds of exabytes of file data on premises and want to move this data to AWS for greater agility, reliability, security, scalability, and reduced costs. Once their file data is in AWS, organizations often want to do even more with it. For example, they want to use their enterprise data to augment generative AI applications and build and train machine learning models with the broad spectrum of AWS generative AI and machine learning services. They also want the flexibility to use their file data with new AWS applications. However, many AWS data analytics services and applications are built to work with data stored in Amazon S3 as data lakes. After migration, they can use tools that work with Amazon S3 as their data source. Previously, this required data pipelines to copy data between Amazon FSx for OpenZFS file systems and Amazon S3 buckets.

Amazon S3 Access Points attached to FSx for OpenZFS file systems remove data movement and copying requirements by maintaining unified access through both file protocols and Amazon S3 API operations. You can read and write file data using S3 object operations including GetObject, PutObject, and ListObjectsV2. You can attach hundreds of access points to a file system, with each S3 access point configured with application-specific permissions. These access points support the same granular permissions controls as S3 access points that attach to S3 buckets, including AWS Identity and Access Management (IAM) access point policies, Block Public Access, and network origin controls such as restricting access to your Virtual Private Cloud (VPC). Because your data continues to reside in your FSx for OpenZFS file system, you continue to access your data using Network File System (NFS) and benefit from existing data management capabilities.

You can use your file data in Amazon FSx for OpenZFS file systems to power generative AI applications with Amazon Bedrock for Retrieval Augmented Generation (RAG) workflows, train ML models with Amazon SageMaker, and run analytics or business intelligence (BI) with Amazon Athena and AWS Glue as if the data were in S3, using the S3 API. You can also generate insights using open source tools such as Apache Spark and Apache Hive, without moving or refactoring your data.

To get started
You can create and attach an S3 Access Point to your Amazon FSx for OpenZFS file system using the Amazon FSx console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.

To start, you can follow the steps in the Amazon FSx for OpenZFS file system documentation page to create the file system, then, using the Amazon FSx console, go to Actions and select Create S3 access point. Leave the standard configuration and then create.

To monitor the creation progress, you can go to the Amazon FSx console.

Once available, choose the name of the new S3 access point and review the access point summary. This summary includes an automatically generated alias that works anywhere you would normally use S3 bucket names.

Using the bucket-style alias, you can access the FSx data directly through S3 API operations.

  • List objects using the ListObjectsV2 API

  • Get files using the GetObject API

  • Write data using the PutObject API

The data continues to be accessible via NFS.

Beyond accessing your FSx data through the S3 API, you can work with your data using the broad range of AI, ML, and analytics services that work with data in S3. For example, I built an Amazon Bedrock Knowledge Base using PDFs containing airline customer service information from my travel support application repository, WhatsApp-Powered RAG Travel Support Agent: Elevating Customer Experience with PostgreSQL Knowledge Retrieval, as the data source.

To create the Amazon Bedrock Knowledge Base, I followed the connection steps in Connect to Amazon S3 for your knowledge base user guide. I chose Amazon S3 as the data source, entered my S3 access point alias as the S3 source, then configured and created the knowledge base.

Once the knowledge base is synchronized, I can see all documents and the Document source as S3.

Finally, I ran queries against the knowledge base and verified that it successfully used the file data from my Amazon FSx for OpenZFS file system to provide contextual answers, demonstrating seamless integration without data movement.

Things to know
Integration and access control – Amazon S3 Access Points for Amazon FSx for OpenZFS file systems support standard S3 API operations (such as GetObject, ListObjectsV2, PutObject) through the S3 endpoint, with granular access controls through AWS Identity and Access Management (IAM) permissions and file system user authentication. Your S3 Access Point includes an automatically generated access point alias for data access using S3 bucket names, and public access is blocked by default for Amazon FSx resources.

Data management – Your data stays in your Amazon FSx for OpenZFS file system while becoming accessible as if it were in Amazon S3, eliminating the need for data movement or copies, with file data remaining accessible through NFS file protocols.

Performance – Amazon S3 Access Points for Amazon FSx for OpenZFS file systems deliver first-byte latency in the tens of milliseconds range, consistent with S3 bucket access. Performance scales with your Amazon FSx file system’s provisioned throughput, with maximum throughput determined by your underlying FSx file system configuration.

Pricing – You’re billed by Amazon S3 for the requests and data transfer costs through your S3 Access Point, in addition to your standard Amazon FSx charges. Learn more on the Amazon FSx for OpenZFS pricing page.

You can get started today using the Amazon FSx console, AWS CLI, or AWS SDK to attach Amazon S3 Access Points to your Amazon FSx for OpenZFS file systems. The feature is available in the following AWS Regions: US East (N. Virginia, Ohio), US West (Oregon), Europe (Frankfurt, Ireland, Stockholm), and Asia Pacific (Hong Kong, Singapore, Sydney, Tokyo).

— Eli

AWS Weekly Roundup: re:Inforce re:Cap, Valkey GLIDE 2.0, Avro and Protobuf or MCP Servers on Lambda, and more (June 23, 2025)

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-reinforce-recap-valkey-glide-2-0-avro-and-protobuf-or-mcp-servers-on-lambda-and-more-june-23-2025/

Last week’s hallmark event was the security-focused AWS re:Inforce conference.


AWS re:Inforce 2025

AWS re:Inforce 2025

Now a tradition, the blog team wrote a re:Cap post to summarize the announcements and link to some of the top blog posts.

To further summarize, several new security innovations were announced, including enhanced IAM Access Analyzer capabilities, MFA enforcement for root users, and threat intelligence integration with AWS Network Firewall. Other notable updates include exportable public SSL/TLS certificates from AWS Certificate Manager, a simplified AWS WAF console experience, and a new AWS Shield feature for proactive network security (in preview). Additionally, AWS Security Hub has been enhanced for risk prioritization (Preview), and Amazon GuardDuty now supports Amazon EKS clusters.

But my favorite announcement came from the Amazon Verified Permissions team. They released an open source package for Express.js, enabling developers to implement external fine-grained authorization for web application APIs. This simplifies authorization integration, reducing code complexity and improving application security.

The team also published a blog post that outlines how to create a Verified Permissions policy store, add Cedar and Verified Permissions authorisation middleware to your app, create and deploy a Cedar schema, and create and deploy Cedar policies. The Cedar schema is generated from an OpenAPI specification and formatted for use with the AWS Command Line Interface (CLI).

Let’s look at last week’s other new announcements.

Last week’s launches
Apart from re:Inforce, here are the launches that got my attention.

Kafka customers use Avro and Protobuf formats for efficient data storage, fast serialization and deserialization, schema evolution support, and interoperability between different programming languages. They utilize schema registries to manage, evolve, and validate schemas before data enters processing pipelines. Previously, you were required to write custom code within your Lambda function to validate, deserialize, and filter events when using these data formats. With this launch, Lambda natively supports Avro and Protobuf, as well as integration with GSR, CCSR, and SCSR. This enables you to process your Kafka events using these data formats without writing custom code. Additionally, you can optimize costs through event filtering to prevent unnecessary function invocations.

  • Amazon S3 Express One Zone now supports atomic renaming of objects with a single API call – The RenameObject API simplifies data management in S3 directory buckets by transforming a multi-step rename operation into a single API call. This means you can now rename objects in S3 Express One Zone by specifying an existing object’s name as the source and the new name as the destination within the same S3 directory bucket. With no data movement involved, this capability accelerates applications like log file management, media processing, and data analytics, while also lowering costs. For instance, renaming a 1-terabyte log file can now complete in milliseconds, instead of hours, significantly accelerating applications and reducing costs.
  • Valkey introduces GLIDE 2.0 with support for Go, OpenTelemetry, and pipeline batching – AWS, in partnership with Google and the Valkey community, announces the general availability of General Language Independent Driver for the Enterprise (GLIDE) 2.0. This is the latest release of one of AWS’s official open-source Valkey client libraries. Valkey, the most permissive open-source alternative to Redis, is stewarded by the Linux Foundation and will always remain open-source. Valkey GLIDE is a reliable, high-performance, multi-language client that supports all Valkey commands

GLIDE 2.0 introduces new capabilities that expand developer support, improve observability, and optimise performance for high-throughput workloads. Valkey GLIDE 2.0 extends its multi-language support to Go (contributed by Google), joining Java, Python, and Node.js to provide a consistent, fully compatible API experience across all four languages. More language support is on the way. With this release, Valkey GLIDE now supports OpenTelemetry, an open-source, vendor-neutral framework that enables developers to generate, collect, and export telemetry data and critical client-side performance insights. Additionally, GLIDE 2.0 introduces batching capabilities, reducing network overhead and latency for high-frequency use cases by allowing multiple commands to be grouped and executed as a single operation.

You can discover more about Valkey GLIDE in this recent episode of the AWS Developers Podcast: Inside Valkey GLIDE: building a next-gen Valkey client library with Rust.

Podcast episode on Valkey Glide

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Some other reading
My Belgian compatriot Alexis has written the first article of a two-part series explaining how to develop an MCP Tool server with a streamable HTTP transport and deploy it on Lambda and API Gateway. This is a must-read for anyone implementing MCP servers on AWS. I’m eagerly looking forward to the second part, where Alexis will discuss authentication and authorization for remote MCP servers.

Other AWS events
Check your calendar and sign up for upcoming AWS events.

AWS GenAI Lofts are collaborative spaces and immersive experiences that showcase AWS expertise in cloud computing and AI. They provide startups and developers with hands-on access to AI products and services, exclusive sessions with industry leaders, and valuable networking opportunities with investors and peers. Find a GenAI Loft location near you and don’t forget to register.

AWS Summits are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Japan (this week June 25 – 26), Online in India (June 26), New-York City (July 16).

Save the date for these upcoming Summits in July and August: Taipei (July 29), Jakarta (August 7), Mexico (August 8), São Paulo (August 13), and Johannesburg (August 20) (and more to come in September and October).

Browse all upcoming AWS led in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehose

Post Syndicated from Pratik Patel original https://aws.amazon.com/blogs/big-data/stream-data-from-amazon-msk-to-apache-iceberg-tables-in-amazon-s3-and-amazon-s3-tables-using-amazon-data-firehose/

In today’s data-driven/fast-paced landscape/environment real-time streaming analytics has become critical for business success. From detecting fraudulent transactions in financial services to monitoring Internet of Things (IoT) sensor data in manufacturing, or tracking user behavior in ecommerce platforms, streaming analytics enables organizations to make split-second decisions and respond to opportunities and threats as they emerge.

Increasingly, organizations are adopting Apache Iceberg, an open source table format that simplifies data processing on large datasets stored in data lakes. Iceberg brings SQL-like familiarity to big data, offering capabilities such as ACID transactions, row-level operations, partition evolution, data versioning, incremental processing, and advanced query scanning. It seamlessly integrates with popular open source big data processing frameworks Apache Spark, Apache Hive, Apache Flink, Presto, and Trino. Amazon Simple Storage Service (Amazon S3) supports Iceberg tables both directly using the Iceberg table format and in Amazon S3 Tables.

Although Amazon Managed Streaming for Apache Kafka (Amazon MSK) provides robust, scalable streaming capabilities for real-time data needs, many customers need to efficiently and seamlessly deliver their streaming data from Amazon MSK to Iceberg tables in Amazon S3 and S3 Tables. This is where Amazon Data Firehose (Firehose) comes in. With its built-in support for Iceberg tables in Amazon S3 and S3 Tables, Firehose makes it possible to seamlessly deliver streaming data from provisioned MSK clusters to Iceberg tables in Amazon S3 and S3 Tables.

As a fully managed extract, transform, and load (ETL) service, Firehose reads data from your Apache Kafka topics, transforms the records, and writes them directly to Iceberg tables in your data lake in Amazon S3. This new capability requires no code or infrastructure management on your part, allowing for continuous, efficient data loading from Amazon MSK to Iceberg in Amazon S3.In this post, we walk through two solutions that demonstrate how to stream data from your Amazon MSK provisioned cluster to Iceberg-based data lakes in Amazon S3 using Firehose.

Solution 1 overview: Amazon MSK to Iceberg tables in Amazon S3

The following diagram illustrates the high-level architecture to deliver streaming messages from Amazon MSK to Iceberg tables in Amazon S3.

bdb-4769-image-1

Prerequisites

To follow the tutorial in this post, you need the following prerequisites:

Verify permission

Before configuring the Firehose delivery stream, you must verify the destination table available in the Data Catalog.

  1. On the AWS Glue console, go to Glue Data Catalog and verify the Iceberg table is available with the required attributes.

bdb-4769-image-2

  1. Verify your Amazon MSK provisioned cluster is up and running with IAM authentication, and multi-VPC connectivity is enabled for it.

bdb-4769-image-3

  1. Grant Firehose access to your private MSK cluster:
    1. On the Amazon MSK console, go to the cluster and choose Properties and Security settings.
    2. Edit the cluster policy and define a policy similar to the following example:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Principal": {
        "Service": [
          "firehose.amazonaws.com"
        ]
    },
    "Effect": "Allow",
    "Action": [
      "kafka:CreateVpcConnection"
    ],
    "Resource": "<Amazon MSK cluster-arn>"
    }
  ]
}

This ensures Firehose has the necessary permissions on the source Amazon MSK provisioned cluster.

Create a Firehose role

This section describes the permissions that grant Firehose access to ingest, process, and deliver data from source to destination. You must specify an IAM role that grants Firehose permissions to ingest source data from the specified Amazon MSK provisioned cluster. Make sure that the following trust policies are attached to that role so that Firehose can assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Principal": {
        "Service": [
          "firehose.amazonaws.com"
        ]
      },
      "Effect": "Allow",
      "Action": "sts:AssumeRole"
    }
  ]
}

Make sure that this role grants Firehose the following permissions to ingest source data from the specified Amazon MSK provisioned cluster:

{
   "Version": "2012-10-17",      
   "Statement": [{
        "Effect":"Allow",
        "Action": [
           "kafka:GetBootstrapBrokers",
           "kafka:DescribeCluster",
           "kafka:DescribeClusterV2",
           "kafka-cluster:Connect"
         ],
         "Resource": "<CLUSTER-ARN>"
       },
       {
         "Effect":"Allow",
         "Action": [
           "kafka-cluster:DescribeTopic",
           "kafka-cluster:DescribeTopicDynamicConfiguration",
           "kafka-cluster:ReadData"
         ],
         "Resource": "<TOPIC-ARN>"
       }]
}

Make sure the Firehose role has permissions to the Glue Data Catalog and S3 bucket:

{
    "Version": "2012-10-17",  
    "Statement":
    [    
        {      
            "Effect": "Allow",      
            "Action": [
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:UpdateTable"
            ],      
            "Resource": [   
                "arn:aws:glue:<region>:<aws-account-id>:catalog",
                "arn:aws:glue:<region>:<aws-account-id>:database/*",
                "arn:aws:glue:<region>:<aws-account-id>:table/*/*"             
            ]    
        },        
        {      
            "Effect": "Allow",      
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:PutObject",
                "s3:DeleteObject"
            ],      
            "Resource": [   
                "arn:aws:s3:::<S3 bucket name>",
                "arn:aws:s3:::<S3 bucket name>/*"              
            ]    
        } 
    ]
}    

For detailed policies, refer to the following resources:

Now you have verified that your source MSK cluster and destination Iceberg table are available, you’re ready to set up Firehose to deliver streaming data to the Iceberg tables in Amazon S3.

Create a Firehose stream

Complete the following steps to create a Firehose stream:

  1. On the Firehose console, choose Create Firehose stream.
  2. Choose Amazon MSK for Source and Apache Iceberg Tables for Destination.

bdb-4769-image-4

  1. Provide a Firehose stream name and specify the cluster configurations.

bdb-4769-image-5

  1. You can choose an MSK cluster in the current account or another account.
  2. To choose the cluster, it must be in active state with IAM as one of its access control methods and multi-VPC connectivity should be enabled.

bdb-4769-image-6

  1. Provide the MSK topic name from which Firehose will read the data.

bdb-4769-image-7

  1. Enter the Firehose stream name.

bdb-4769-image-8

  1. Enter the destination settings where you can opt to send data in the current account or across accounts.
  2. Select the account location as Current account, choose an appropriate AWS Region, and for Catalog, choose the current account ID.

bdb-4769-image-9

To route streaming data to different Iceberg tables and perform operations such as insert, update, and delete, you can use Firehose JQ expressions. You can find the required information here.

  1. Provide the unique key configuration, which makes it possible to perform update and delete actions on your data.

bdb-4769-image-10

  1. Go to Buffer hints and configure Buffer size to 1 MiB and Buffer interval to 60 seconds. You can tune these settings according to your use case needs.
  2. Configure your backup settings by providing an S3 backup bucket.

With Firehose, you can configure backup settings by specifying an S3 backup bucket with custom prefixes like error, so failed records are automatically preserved and accessible for troubleshooting and reprocessing.

bdb-4769-image-11

  1. Under Advanced settings, enable Amazon CloudWatch error logging.

bdb-4769-image-12

  1. Under Service access, choose the IAM role you created earlier for Firehose.
  2. Verify your configurations and choose Create Firehose stream.

bdb-4769-image-14

The Firehose stream will be available and it will stream data from the MSK topic to the Iceberg table in Amazon S3.

bdb-4769-image-15

You can query the table with Amazon Athena to validate the streaming data.

  1. On the Athena console, open the query editor.
  2. Choose the Iceberg table and run a table preview.

You will be able to access the streaming data in the table.

bdb-4769-image-16

Solution 2 overview: Amazon MSK to S3 Tables

S3 Tables is built on Iceberg’s open table format, providing table-like capabilities directly to Amazon S3. You can organize and query data using familiar table semantics while using Iceberg’s features for schema evolution, partition evolution, and time travel capabilities. The feature performs ACID-compliant transactions and supports INSERT, UPDATE, and DELETE operations in Amazon S3 data, making data lake management more efficient and reliable.

You can use Firehose to deliver streaming data from an Amazon MSK provisioned cluster to Iceberg tables in Amazon S3. You can create an S3 table bucket using the Amazon S3 console, and it registers the bucket to AWS Lake Formation, which helps you manage fine-grained access control for your Iceberg-based data lake on S3 Tables. The following diagram illustrates the solution architecture.

Prerequisites

You should have the following prerequisites:

  • An AWS account
  • An active Amazon MSK provisioned cluster with IAM access control authentication enabled and multi-VPC connectivity
  • The Firehose role mentioned earlier with the additional IAM policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Further, in your Firehose role, add s3tablescatalog as a resource to provide access to S3 Table as shown below.

Create an S3 table bucket

To create an S3 table bucket on the Amazon S3 console, refer to Creating a table bucket.

When you create your first table bucket with the Enable integration option, Amazon S3 attempts to automatically integrate your table bucket with AWS analytics services. This integration makes it possible to use AWS analytics services to query all tables in the current Region. This is an important step for the further set up. If this integration is already in place, you can use the AWS Command Line Interface (AWS CLI) as follows:

aws s3tables create-table-bucket --region <region id> --name <bucket name>

bdb-4769-image-18

Create a namespace

An S3 table namespace is a logical construct within an S3 table bucket. Each table belongs to a single namespace. Before creating a table, you must create a namespace to group tables under. You can create a namespace by using the Amazon S3 REST API, AWS SDK, AWS CLI, or integrated query engines.

You can use the following AWS CLI to create a table namespace:

aws s3tables create-namespace --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket --namespace example_namespace

Create a table

An S3 table is a sub-resource of a table bucket. This resource stores S3 tables in Iceberg format so you can work with them using query engines and other applications that support Iceberg. You can create a table with the following AWS CLI command:

aws s3tables create-table --cli-input-json file://mytabledefinition.json

The following code is for mytabledefinition.json:

{
    "tableBucketARN": "arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket",
    "namespace": "example_namespace ",
    "name": "example_table",
    "format": "ICEBERG",
    "metadata": {
        "iceberg": {
            "schema": {
                "fields": [
                     {"name": "id", "type": "int", "required": true},
                     {"name": "name", "type": "string"},
                     {"name": "value", "type": "int"}
                ]
            }
        }
    }
}

Now you have the required table with the relevant attributes available in Lake Formation.

Grant Lake Formation permissions on your table resources

After integration, Lake Formation manages access to your table resources. It uses its own permissions model (Lake Formation permissions) that enables fine-grained access control for Glue Data Catalog resources. To allow Firehose to write data to S3 Tables, you can grant a principal Lake Formation permission on a table in the S3 table bucket, either through the Lake Formation console or AWS CLI. Complete the following steps:

  1. Make sure you’re running AWS CLI commands as a data lake administrator. For more information, see Create a data lake administrator.
  2. Run the following command to grant Lake Formation permissions on the table in the S3 table bucket to an IAM principal (Firehose role) to access the table:
aws lakeformation grant-permissions \
--region <region e.g. us-east-1> \
--cli-input-json \
'{
    "Principal": {
        "DataLakePrincipalIdentifier": "<Amazon Data Firehose role ARN e.g. arn:aws:iam::<accound-id>:role/ExampleRole>"
    },
    "Resource": {
        "Table": {
            "CatalogId": "<account-id>:<s3tablescatalog>/<S3 table bucket name>",
            "DatabaseName": "<S3 table bucket namespace e.g. test_namespace>",
            "Name": "<S3 table bucket table name e.g. test_table>"
        }
    },
    "Permissions": [
        "ALL"
    ]
}'

Set up a Firehose stream to S3 Tables

To set up a Firehose stream to S3 Tables using the Firehose console, complete the following steps:

  1. On the Firehose console, choose Create Firehose stream.
  2. For Source, choose Amazon MSK.
  3. For Destination, choose Apache Iceberg Tables.
  4. Enter a Firehose stream name.
  5. Configure your source settings.
  6. For Destination settings, select Current Account, choose your Region, and enter the name of the table bucket you want to stream in.
  7. Configure the database and table names using Unique Key configuration settings, JSONQuery expressions, or in an AWS Lambda function.

For more information, refer to Route incoming records to a single Iceberg table and Route incoming records to different Iceberg tables.

  1. Under Backup settings, specify a S3 backup bucket.
  2. For Existing IAM roles under Advanced settings, choose the IAM role you created for Firehose.
  3. Choose Create Firehose stream.

The Firehose stream will be available and it will stream data from the Amazon MSK topic to the Iceberg table. You can verify it by querying the Iceberg table using an Athena query.

bdb-4769-image-19

Clean up

It’s always a good practice to clean up the resources created as part of this post to avoid additional costs. To clean up your resources, delete the MSK cluster, Firehose stream, Iceberg S3 table bucket, S3 general purpose bucket, and CloudWatch logs.

Conclusion

In this post, we demonstrated two approaches for data streaming from Amazon MSK to data lakes using Firehose: direct streaming to Iceberg tables in Amazon S3, and streaming to S3 Tables. Firehose alleviates the complexity of traditional data pipeline management by offering a fully managed, no-code approach that handles data transformation, compression, and error handling automatically. The seamless integration between Amazon MSK, Firehose, and Iceberg format in Amazon S3 demonstrates AWS’s commitment to simplifying big data architectures while maintaining the robust features of ACID compliance and advanced query capabilities that modern data lakes demand. We hope you found this post helpful and encourage you to try out this solution and simplify your streaming data pipelines to Iceberg tables.


About the authors

bdb-4769-image-21Pratik Patel is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Amar is a seasoned Data Analytics specialist at AWS UK, who helps AWS customers to deliver large-scale data solutions. With deep expertise in AWS analytics and machine learning services, he enables organizations to drive data-driven transformation and innovation. He is passionate about building high-impact solutions and actively engages with the tech community to share knowledge and best practices in data analytics.

bdb-4769-image-22Priyanka Chaudhary is a Senior Solutions Architect and data analytics specialist. She works with AWS customers as their trusted advisor, providing technical guidance and support in building Well-Architected, innovative industry solutions.

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/reduce-time-to-access-your-transactional-data-for-analytical-processing-using-the-power-of-amazon-sagemaker-lakehouse-and-zero-etl/

As the lines between analytics and AI continue to blur, organizations find themselves dealing with converging workloads and data needs. Historical analytics data is now being used to train machine learning models and power generative AI applications. This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently. Moreover, organizations attempting to consolidate disparate data sources into a data lakehouse have historically relied on extract, transform, and load (ETL) processes, which have become a significant bottleneck in their data analytics and machine learning initiatives. Traditional ETL processes are often complex, requiring significant time and resources to build and maintain. As data volumes grow, so do the costs associated with ETL, leading to delayed insights and increased operational overhead. Many organizations find themselves struggling to efficiently onboard transactional data into their data lakes and warehouses, hindering their ability to derive timely insights and make data-driven decisions. In this post, we address these challenges with a two-pronged approach:

  • Unified data management: Using Amazon SageMaker Lakehouse to get unified access to all your data across multiple sources for analytics and AI initiatives with a single copy of data, regardless of how and where the data is stored. SageMaker Lakehouse is powered by AWS Glue Data Catalog and AWS Lake Formation and brings together your existing data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses with integrated access controls. In addition, you can ingest data from operational databases and enterprise applications to the lakehouse in near real-time using zero-ETL which is a set of fully-managed integrations by AWS that eliminates or minimizes the need to build ETL data pipelines.
  • Unified development experience: Using Amazon SageMaker Unified Studio to discover your data and put it to work using familiar AWS tools for complete development workflows, including model development, generative AI application development, data processing, and SQL analytics, in a single governed environment.

In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio. Through this integrated environment, data analysts, data scientists, and ML engineers can use SageMaker Unified Studio to perform advanced SQL analytics on the transactional data.

Architecture patterns for a unified data management and unified development experience

In this architecture pattern, we show you how to use zero-ETL integrations to seamlessly replicate transactional data from Amazon Aurora MySQL-Compatible Edition, an operational database, into the Redshift Managed Storage layer. This zero-ETL approach eliminates the need for complex data extraction, transformation, and loading processes, enabling near real-time access to operational data for analytics. The transferred data is then cataloged using a federated catalog in the SageMaker Lakehouse Catalog and exposed through the Iceberg Rest Catalog API, facilitating comprehensive data analysis by consumer applications.

You then use SageMaker Unified Studio, to perform advanced analytics on the transactional data bridging the gap between operational databases and advanced analytics capabilities.

Prerequisites

Make sure that you have the following prerequisites:

Deployment steps

In this section, we share steps for deploying resources needed for Zero-ETL integration using AWS CloudFormation.

Setup resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use. The CloudFormation template provisions the following components:

  1. An Aurora MySQL provisioned cluster (source).
  2. An Amazon Redshift Serverless data warehouse (target).
  3. Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless). See Aurora zero-ETL integrations with Amazon Redshift for more information.

Create your resources

To create resources using AWS Cloudformation, follow these steps:

  1. Sign in to the AWS Management Console.
  2. Select the us-east-1 AWS Region in which to create the stack.
  3. Open the AWS CloudFormation
  4. Choose Launch Stack
    https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/template?templateURL=https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/BDB-4866/aurora-zero-etl-redshift-lakehouse-cfn.yaml
  5. Choose Next.
    This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  6. For Stack name, enter a stack name, for example UnifiedLHBlogpost.
  7. Keep the default values for the rest of the Parameters and choose Next.
  8. On the next screen, choose Next.
  9. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Submit.

Stack creation can take up to 30 minutes.

  1. After the stack creation is complete, go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:
    • NamespaceName
    • PortNumber
    • RDSPassword
    • RDSUsername
    • RedshiftClusterSecurityGroupName
    • RedshiftPassword
    • RedshiftUsername
    • VPC
    • Workgroupname
    • ZeroETLServicesRoleNameArn

Implementation steps

To implement this solution, follow these steps:

Setting up zero-ETL integration

A zero-ETL integration is already created as a part of CloudFormation template provided. Use the following steps from the Zero-ETL integration post to complete setting up the integration.:

  1. Create a database from integration in Amazon Redshift
  2. Populate source data in Aurora MySQL
  3. Validate the source data in your Amazon Redshift data warehouse

Bring Amazon Redshift metadata to the SageMaker Lakehouse catalog

Now that transactional data from Aurora MySQL is replicating into Redshift tables through zero-ETL integration, you next bring the data into SageMaker Lakehouse, so that operational data can co-exist and be accessed and governed together with other data sources in the data lake. You do this by registering an existing Amazon Redshift Serverless namespace that has Zero-ETL tables as a federated catalog in SageMaker Lakehouse.

Before starting the next steps, you need to configure data lake administrators in AWS Lake Formation.

  1. Go to the Lake Formation console and in the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add.
  2. In the Add administrators page, under Access type, select Data Lake administrator.
  3. Under IAM users and roles, select Admin. Choose Confirm.

Add AWS Lake Formation Administrators

  1. On the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Confirm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Add AWS Lake Formation Administrators 2

With the data lake administrators configured, you’re ready to bring your existing Amazon Redshift metadata to SageMaker Lakehouse catalog:

  1. From the Amazon Redshift Serverless console, choose Namespace configuration in the navigation pane.
  2. Under Actions, choose Register with AWS Glue Data Catalog. You can find more details on registering a federated Amazon Redshift catalog in Registering namespaces to the AWS Glue Data Catalog.

  1. Choose Register. This will register the namespace to AWS Glue Data Catalog

  1. After registration is complete, the Namespace register status will change to Registered to AWS Glue Data Catalog.
  2. Navigate to the Lake Formation console and choose Catalogs New under Data Catalog in the navigation pane. Here you can see a pending catalog invitation is available for the Amazon Redshift namespace registered in Data Catalog.

  1. Select the pending invitation and choose Approve and create catalog. For more information, see Creating Amazon Redshift federated catalogs.

  1. Enter the Name, Description, and IAM role (created by the CloudFormation template). Choose Next.

  1. Grant permissions using a principal that is eligible to provide all permissions (an admin user).
    • Select IAM users and rules and choose Admin.
    • Under Catalog permissions, select Super user to grant super user permissions.

  1. Assigning super user permissions grants the user unrestricted permissions to the resources (databases, tables, views) within this catalog. Follow the principal of least privilege to grant users only the permissions required to perform a task wherever applicable as a security best practice.

  1. As final step, review all settings and choose Create Catalog

After the catalog is created, you will see two objects under Catalogs. dev refers to the local dev database inside Amazon Redshift, and aurora_zeroetl_integration is the database created for Aurora to Amazon Redshift ZeroETL tables

Fine-grained access control

To set up fine-grained access control, follow these steps:

  1. To grant permission to individual objects, choose Action and then select Grant.

  1. On the Principals page, grant access to individual objects or more than one object to different principals under the federated catalog.

Access lakehouse data using SageMaker Unified Studio

SageMaker Unified Studio provides an integrated experience outside the console to use all your data for analytics and AI applications. In this post, we show you how to use the new experience through the Amazon SageMaker management console to create a SageMaker platform domain using the quick setup method. To do this, you set up IAM Identity Center, a SageMaker Unified Studio domain, and then access data through SageMaker Unified Studio.

Set up IAM Identity Center

Before creating the domain, makes sure that your data admins and data workers are ready to use the Unified Studio experience by enabling IAM Identity Center for single sign-on following the steps in Setting up Amazon SageMaker Unified Studio. You can use Identity Center to set up single sign-on for individual accounts and for accounts managed through AWS Organizations. Add users or groups to the IAM instance as appropriate. The following screenshot shows an example email sent to a user through which they can activate their account in IAM Identity Center.

Set up SageMaker Unified domain

Follow steps in Create a Amazon SageMaker Unified Studio domain – quick setup to set up a SageMaker Unified Studio domain. You need to choose the VPC that was created by the CloudFormation stack earlier.

The quick setup method also has a Create VPC option that sets up a new VPC, subnets, NAT Gateway, VPC endpoints, and so on, and is meant for testing purposes. There are charges associated with this, so delete the domain after testing.

If you see the No models accessible, you can use the Grant model access button to grant access to Amazon Bedrock serverless models for use in SageMaker Unified Studio, for AI/ML use-cases

  1. Fill in the sections for Domain Name. For example, MyOLTPDomain. In the VPC section, select the VPC that was provisioned by the CloudFormation stack, for example UnifiedLHBlogpost-VPC. Select subnets and choose Continue.

  1. In the IAM Identity Center User section, look up the newly created user from (for example, Data User1) and add them to the domain. Choose Create Domain. You should see the new domain along with a link to open Unified Studio.

Access data using SageMaker Unified Studio

To access and analyze your data in SageMaker Unified Studio, follow these steps:

    1. Select the URL for SageMaker Unified Studio. Choose Sign in with SSO and sign in using the IAM user, for example datauser1, and you will be prompted to select a multi-factor authentication (MFA) method.
    2. Select Authenticator App and proceed with next steps. For more information about SSO setup, see Managing users in Amazon SageMaker Unified Studio.After you have signed in to the Unified Studio domain, you need to set up a new project. For this illustration, we created a new sample project called MyOLTPDataProject using the project profile for SQL Analytics as shown here.A project profile is a template for a project that defines what blueprints are applied to the project, including underlying AWS compute and data resources. Wait for the new project to be set up, and when status is Active, open the project in Unified Studio.By default, the project will have access to the default Data Catalog (AWSDataCatalog). For the federated redshift catalog redshift-consumer-catalog to be visible, you need to grant permissions to the project IAM role using Lake Formation. For this example, using the Lake Formation console, we have granted below access to the demodb database that is part of the Zero-ETL catalog to the Unified Studio project IAM role. Follow steps in Adding existing databases and catalogs using AWS Lake Formation permissions.In your SageMaker Unified Studio Project’s Data section, connect to the Lakehouse Federated catalog that you created and registered earlier (for example redshift-zetl-auroramysql-catalog/aurora_zeroetl_integration). Select the objects that you want to query and execute them using the Redshift Query Editor integrated with SageMaker Unified Studio.If you select Redshift, you will be transferred to the Query editor where you can execute the SQL and see the results as shown in the following figure.

With this integration of Amazon Redshift metadata with SageMaker Lakehouse federated catalog, you have access to your existing Redshift data warehouse objects in your organizations centralized catalog managed by SageMaker Lakehouse catalog and join the existing Redshift data seamlessly with the data stored in your Amazon S3 data lake. This solution helps you avoid unnecessary ETL processes to copy data between the data lake and the data warehouse and minimize data redundancy.

You can further integrate more data sources serving transactional workloads such as Amazon DynamoDB and enterprise applications such as Salesforce and ServiceNow. The architecture shared in this post for accelerated analytical processing using Zero-ETL and SageMaker Lakehouse can be further expanded by adding Zero-ETL integrations for DynamoDB using DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse and for enterprise applications by following the instructions in Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Clean up

When you’re finished, delete the CloudFormation stack to avoid incurring costs for some of the AWS resources used in this walkthrough incur a cost. Complete the following steps:

  1. On the CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.
  5. On the Sagemaker console, choose Domains and delete the domain created for testing.

Summary

In this post, you’ve learned how to bring data from operational databases and applications into your lake house in near real-time through Zero-ETL integrations. You’ve also learned about a unified development experience to create a project and bring in the operational data to the lakehouse, which is accessible through SageMaker Unified Studio, and query the data using integration with Amazon Redshift Query Editor. You can use the following resources in addition to this post to quickly start your journey to make your transactional data available for analytical processing.

  1. AWS zero-ETL
  2. SageMaker Unified Studio
  3. SageMaker Lakehouse
  4. Getting started with Amazon SageMaker Lakehouse


About the authors

Avijit Goswami is a Principal Data Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Saman Irfan is a Senior Specialist Solutions Architect focusing on Data Analytics at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudarshan Narasimhan is a Principal Solutions Architect at AWS specialized in data, analytics and databases. With over 19 years of experience in Data roles, he is currently helping AWS Partners & customers build modern data architectures. As a specialist & trusted advisor he helps partners build & GTM with scalable, secure and high performing data solutions on AWS. In his spare time, he enjoys spending time with his family, travelling, avidly consuming podcasts and being heartbroken about Man United’s current state.

Simplify enterprise data access using the Amazon Redshift integration with Amazon S3 Access Grants

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/simplify-enterprise-data-access-using-the-amazon-redshift-integration-with-amazon-s3-access-grants/

Scaling data access securely while maintaining operational efficiency is a critical challenge for organizations. Access rights are often fragmented across various AWS services, as different business units own and manage different data stores, such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. As data grows, modeling access in AWS Identity and Access Management (IAM) policies becomes challenging for data owners, as they try to manage access for different groups and users across accounts in the organization. Managing these distributed access rights requires substantial overhead, because security teams and data owners must collaborate to update and monitor permissions to make sure data is only accessible to authorized users.

Recognizing this challenge, the Amazon S3 Access Grants integration with Amazon Redshift allows centralized user authentication through AWS IAM Identity Center, providing unified identity across the organization. S3 Access Grants allows specific IAM Identity Center users or groups to access registered Amazon S3 locations through a grant. Creating a grant with a group as grantee lets the group members access only the S3 bucket, prefix, or object within the grant’s scope. This means that access can be managed by simply creating a grant for a group and adding or removing the user from the group, reducing administrative overhead.

In this post, we show how to grant Amazon S3 permissions to IAM Identity Center users and groups using S3 Access Grants. We also test the integration using an IAM Identity Center federated user to unload data from Amazon Redshift to Amazon S3 and load data from Amazon S3 to Amazon Redshift.

Solution overview

This post covers a use case where a large organization manages thousands of corporate users across multiple business units through their identity provider (IdP). These users regularly interact with vast amounts of data stored across numerous S3 buckets, frequently performing extract, transform, and load (ETL) operations through Amazon Redshift. Their goal is to have a simpler ETL process of data loading and unloading operations in Amazon Redshift without managing multiple IAM roles and policies for Amazon S3 access. Also, they want a centralized access management solution that seamlessly integrates their corporate identities from existing IdP with AWS services.

For this solution, AWS Organizations is enabled and IAM Identity Center is configured in the delegated administration account. The organization has two member accounts: Member Account 1 runs analytical workloads on Amazon Redshift, with all the services enabled with trusted identity propagation, and Member Account 2 manages data stored in Amazon S3; here you will set up S3 Access Grants. Amazon Redshift will load the user-specific data from Amazon S3 stored in Member Account 2 using access control based on IAM Identity Center users and groups. This improves the user experience maintaining a single authentication mechanism within an organization, retaining access control, and resource separation using AWS accounts as a boundary per business units.

The following diagram illustrates the solution architecture.

Figure 1: Architecture showing the solution

Figure 1: Architecture showing the solution

To run this solution in a single account, configure Amazon Redshift and S3 Access Grants with account instances of IAM Identity Center. Review When to use account instances for more information.

The solution workflow includes the following steps:

  1. The user configures and connects with their respective clients (such as Amazon Redshift Query Editor v2 or a SQL client) to access Amazon Redshift using IAM Identity Center.
  2. A new browser windows opens and is redirected to the login page of the IdP.
  3. The user logs in with their IdP user name and password.
  4. After the login is successful, the user is redirected to the client application, such as the Amazon Redshift Query Editor.
  5. When the user tries to access data in Amazon S3 using the LOAD or UNLOAD SQL command, Amazon Redshift in Member Account 1 will request credentials from the S3 Access Grants instance from Member Account 2, where the Amazon S3 data is stored. This request will contain the user context.
  6. S3 Access Grants will then evaluate the request against the grants it has, matching the identity specified in the grant with the one received in the request. If there is a match, the requestor will receive temporary access to the Amazon S3 locations specified in the grant’s scope.

To implement the solution, we walk you through the following steps:

  1. Enable S3 Access Grants in your Amazon Redshift managed application.
  2. Update IAM role permissions used in the application.
  3. Create a bucket for S3 Access Grants.
  4. Create an IAM policy and role for S3 Access Grants.
  5. Set up S3 Access Grants.
  6. Allow cross-account access of resources.
  7. Create Redshift tables.
  8. Unload and load data in Amazon Redshift.

Prerequisites

You should have the following prerequisites already set up:

Enable S3 Access Grants from the Amazon Redshift managed application

After you have created your Redshift application in IAM Identity Center, you need to perform the following steps to enable S3 Access Grants in the account where Amazon Redshift exists. For this post, we use Member Account 1:

  1. Log in to the AWS Management Console as admin.
  2. On the Amazon Redshift console, choose IAM Identity Center connection in the navigation pane.
  3. Select the managed Redshift application and choose Edit.
  4. Choose Amazon S3 access grants in Trusted identity propagation.
  5. Choose Save changes.

The following screenshot shows the updated configuration.

Figure 2: Redshift managed application

Figure 2: Redshift managed application

Update the IAM role permission attached to the Amazon Redshift managed application

The Amazon Redshift managed application has an IAM role attached (in the preceding screenshot, you can see the role called IAMIDCRedshiftRole under IAM role for IAM Identity Center access. We now need to modify the policy on this role and add permissions to allow interaction with Amazon S3. Edit the role and add s3:GetAccessGrantsInstanceForPrefix and s3:GetDataAccess as shown in the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowGetRedsfhitInformation",
            "Effect": "Allow",
            "Action": [
                "redshift-serverless:ListNamespaces",
                "redshift-serverless:ListWorkgroups",
                "redshift:DescribeQev2IdcApplications",
                "redshift-serverless:GetWorkgroup"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowDescribeIdentityCenter",
            "Effect": "Allow",
            "Action": [
                "sso:DescribeApplication",
                "sso:DescribeInstance"
            ],
            "Resource": [
                "arn:aws:sso:::instance/<IAM Identity Center Instance ID>",
                "arn:aws:sso::<Delegated Adminstration AWS Account ID>:application/<IAM Identity Center Instance ID>/*"
            ]
        },
        {
            "Sid": "RetrieveAGinstanceforParticularPrefix",
            "Effect": "Allow",
            "Action": 
                      "s3:GetAccessGrantsInstanceForPrefix",
            "Resource": "*"
        },
        {
            "Sid": "CrossAccountAccessGrantsPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetDataAccess"
            ],
            "Resource": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
        }
    ]
}

Replace <IAM Identity Center Instance ID> with your IAM Identity Center instance ID and <Delegated Adminstration AWS Account ID> with the account ID where IAM Identity Center is set up. You also need to replace the resource in CrossAccountAccessGrantscasePolicy with your S3 Access Grants instance information.

Create an S3 bucket for S3 Access Grants

In this step, you create a S3 bucket that you want to grant access to or use an existing bucket. For this post, we create a bucket called amzn-s3-demo-bucket. You can choose another appropriate name. For more information, see Creating a general purpose bucket.

The bucket must be located in the same AWS Region as your S3 Access Grants instance and IAM Identity Center.

Next, create two folders in the newly created S3 bucket. If you’re using an existing S3 bucket, identify two folders to use for this walkthrough. For this blog post, we create two folders: awssso-sales and awssso-finance, under a bucket named amzn-s3-demo-bucket. The purpose of creating two folders is so that users from different groups have access only to their respective folder.

Create an IAM policy and role for S3 Access Grants

Complete the following steps to create an IAM policy to scope the permissions for a specific access grant:

  1. Create an IAM policy with the following permissions. For more information on creating IAM policy, see Create IAM policies. To get additional information on the following specific policy, refer to Register a location.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ObjectLevelReadPermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:GetObjectVersion",
                    "s3:GetObjectAcl",
                    "s3:GetObjectVersionAcl",
                    "s3:ListMultipartUploadParts"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>/*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": [
                            "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                        ]
                    }
                }
            },
            {
                "Sid": "ObjectLevelWritePermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:PutObjectAcl",
                    "s3:PutObjectVersionAcl",
                    "s3:DeleteObject",
                    "s3:DeleteObjectVersion",
                    "s3:AbortMultipartUpload"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>/*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                    }
                }
            },
            {
                "Sid": "BucketLevelReadPermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                    }
                }
            }
        ]
    }

  2. Create an IAM role that has permission to access your S3 data in the Region. For more information, see IAM role creation. In this example, we create an IAM role called iamidcs3accessgrant. You need to attach the preceding policy to the IAM role.
  3. Use the following trust policy for the IAM role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ForAccessGrants",
                "Effect": "Allow",
                "Principal": {
                    "Service": "access-grants.s3.amazonaws.com"
                },
                "Action": [
                    "sts:AssumeRole",
                    "sts:SetContext",
                    "sts:SetSourceIdentity"
                ],
                "Condition": {
            "StringEquals": {
              "aws:SourceAccount":"<accountId>",
              "aws:SourceArn":"arn:aws:s3:<region>:<accountId>:access-grants/default"
            }
          }
            }
        ]
    }

Set up S3 Access Grants

The S3 Access Grants instance serves as the container for your S3 Access Grants resources, which include registered locations and grants. You can create only one S3 Access Grants instance per Region per account. You can associate this S3 Access Grants instance to your corporate directory with your IAM Identity Center instance. After you’ve done so, you can create grants for your corporate users and groups. S3 Access Grants requires registering a location to map an S3 bucket or prefix to an IAM role, enabling secure access by providing temporary credentials to grantees for that specific location.

Complete the following steps to set up S3 Access Grants:

  1. On the Amazon S3 console, choose your preferred Region.
  2. In the navigation pane, choose Access Grants.
  3. Choose Create S3 Access Grants instance.
  4. Select Add IAM Identity Center instance in <region> and enter the IAM Identity Center instance Amazon Resource Name (ARN). For this post, we use the delegated administration account IAM Identity Center ARN.
  5. Choose Next.

    Figure 3: S3 Access Grants instance

    Figure 3: S3 Access Grants instance

  6. After you create an Amazon S3 Access Grants instance in a Region in your account, you register an Amazon S3 location in that instance. For Location scope, choose Browse S3 or enter the S3 URI path to the location that you want to register. After you enter a URI, you can choose View to browse the location. In this example, we provide the scope as s3://amzn-s3-demo-bucket.
  7. For IAM role, select Choose from existing IAM roles and choose the IAM role you previously created (iamidcs3accessgrant).
  8. Choose Next.

This will register a location in your S3 Access Grants instance.

Figure 4: S3 Access Grants instance location scope

Figure 4: S3 Access Grants instance location scope

  1. You will now create a grant.
    1. If you selected the default Amazon S3 location, use the Subprefix box to narrow the scope of the access grant. For more information, see Working with grants in S3 Access Grants.
    2. If you’re granting access only to an object, select Grant scope is an object. In our example, we register the location as s3://amzn-s3-demo-bucket and then for the subprefix, we specify the folder name followed by an asterisk (awssso-sales/*).
  2. Under Permissions and access, select the Permission level, either Read, Write, or both. In this example, we select both because we will first unload from Amazon S3 to Amazon Redshift and then copy from the same bucket to Amazon Redshift.
  3. For Grantee type, choose Directory identity from IAM Identity Center.
  4. For Directory identity type, you can choose either User or Group. In this example, we choose Group.
  5. For IAM Identity Center group ID, enter the group ID from IAM Identity Center where user and group information belongs.

To get this value, open the IAM Identity Center console and choose Groups in the navigation pane, then choose one of the groups you want to provide access and copy the value under Group ID. In the following example, we collect the group ID information from the delegated administration account.

Figure 5: IAM Identity Center group information

Figure 5: IAM Identity Center group information

  1. Choose Next.

    Figure 6: S3 Access Grants instance permissions and access

    Figure 6: S3 Access Grants instance permissions and access

  2. Choose Finish.

    Figure 7: S3 Access Grants instance review information page

    Figure 7: S3 Access Grants instance review information page

You can view the details of the access grant on the Amazon S3 console, as shown in the following screenshot. For more information, see View a grant.

Figure 8: S3 Access Grants grants

Figure 8: S3 Access Grants grants

Similarly, you can get the details of a location that’s registered in your S3 Access Grants instance. For more information, see View the details of a registered location.

Figure 9: S3 Access Grants locations

Figure 9: S3 Access Grants locations

Allow cross-account access of resources and create initial tables

Now we want to share resources to make our cross-account scenario work. This step is only needed if your Amazon Redshift and Amazon S3 resources are in different accounts. This should be done in the account where Amazon S3 is set up. Complete the following steps:

  1. On the AWS RAM console, in the navigation pane, choose Resource shares.
  2. Choose Create resource share.
  3. For Name, enter a descriptive name for the resource share (for example, s3accessgrant).
  4. For Resources – optional, choose S3 Access Grants. The S3 Access Grants instance you created will be shown; select the default S3 Access Grant instance ARN.
  5. Choose Next.
  6. Under Managed permission for s3:AccessGrants, you can choose to associate a managed permission created by AWS with the resource type, choose an existing customer managed permission, or create your own customer managed permission for supported resource types. In this post, we choose the existing permission named AWSRAMPermissionAccessGrantsData.
  7. Choose Next.
  8. For Grant access to principals, choose Allow sharing only within your organization and enter the account ID where the Redshift instance exists.
  9. Choose Add.
  10. Choose Next.
  11. Choose Create resource share.

The following screenshot shows the new resource share details.

Figure 10: AWS RAM - create resource share wizard

Figure 10: AWS RAM – create resource share wizard

Create tables in Amazon Redshift

As an Amazon Redshift admin user, you need to first create the tables you will use to unload data. In the following code, we create a new store_sales_s3access table:

CREATE TABLE IF NOT EXISTS 
sales_schema.store_sales_s3access ( 
ID INTEGER ENCODE az64, 
Product varchar(20), 
Sales_Amount INTEGER ENCODE az64 
) 
DISTSTYLE AUTO ;

Also make sure the following permissions are applied on the respective IAM Identity Center group; this group is represented in Amazon Redshift as a Redshift role. For this post, we grant permissions to the awssso-sales group:

grant usage on schema sales_schema to role "awsidc:awssso-sales";
grant select,insert  for tables in schema sales_schema to role "awsidc:awssso-sales";

As an Amazon Redshift admin user, you have created a Redshift table and assigned relevant permissions to the Redshift database role awsidc:awssso-sales. Now when an authenticated user that belongs to the group awssso-sales runs a query in Amazon Redshift to access Amazon S3 (such as a COPY, UNLOAD, or Amazon Redshift Spectrum operation), Amazon Redshift retrieves temporary Amazon S3 access credentials scoped to that IAM Identity Center user from S3 Access Grants. Amazon Redshift then uses the retrieved temporary credentials to access the authorized Amazon S3 locations for that query.

Unload and load data in Amazon Redshift

In this step, we log in to the Amazon Redshift Query Editor using IAM Identity Center authentication and run an UNLOAD command to unload data from the table created earlier into the S3 bucket. After that, we run the COPY command to copy information from Amazon S3 into the same table in the same directory we unloaded the data from.

Complete the following steps to access the Amazon Redshift Query Editor with an IAM Identity Center user:

  1. On the Amazon Redshift console, open the Amazon Redshift Query Editor.
  2. Choose (right-click) your Redshift instance and choose Create connection.
  3. Choose IAM Identity Center as your authentication method.
  4. A pop-up will appear. Because your IdP credentials are already cached, it uses the same credentials and connects to the Amazon Redshift Query Editor using IAM Identity Center authentication.

Now you’re ready to run the SQL queries in Amazon Redshift.

Unload data

As a federated user, you will first run an unload command from the table store_sales in the bucket s3://amzn-s3-demo-bucket/awssso-sales/.

In this post, we run an UNLOAD command as a federated IAM Identity Center user (Ethan), where we will be unloading the data from a Redshift table. Replace the S3 bucket name with the one you created.

UNLOAD ('SELECT * FROM "dev"."sales_schema"."store_sales"')
TO 's3://amzn-s3-demo-bucket/awssso-sales/';

The preceding command doesn’t include an IAM role ARN. This simplified syntax not only makes your code more readable, but also reduces the potential for configuration errors. The underlying permissions are handled automatically through S3 Access Grants and trusted identity propagation, maintaining robust security while simplifying permissions management.

Load data

Now we demonstrate a common data workflow using the same federated IAM Identity Center user (Ethan), where we will be running the COPY command accessing the same Amazon S3 location where we previously unloaded our data. Use to following command to load data into a separate table called store_sales_s3access:

copy dev.sales_schema.store_sales_s3access 
from 's3://amzn-s3-demo-bucket/awssso-sales/' delimiter '|'

If user Ethan tries to unload "sales_schema"."store_sales" in sales_schema to a different folder in the S3 bucket (awssso-finance), they get a permission denied error. This is because access is controlled by S3 Access Grants, and this user doesn’t have a grant to the awssso-finance folder. Use the following command to test the access denied use case:

UNLOAD ('SELECT * FROM "dev"."sales_schema"."store_sales"')
TO 's3://amzn-s3-demo-bucket/awssso-finance/';
Figure 11: QEv2 query result error

Figure 11: QEv2 query result error

IAM Identity Center related operations are automatically captured and logged in AWS CloudTrail, offering enhanced visibility and comprehensive audit capabilities. To view detailed error information on the CloudTrail console, choose Event history in the navigation pane, then specify s3.amazonaws.com as the event source and open GetDataAccess.

The following screenshot shows the snippet from the CloudTrail logs showing that user access is denied.

Figure 12: Amazon CloudTrail

Figure 12: Amazon CloudTrail

Clean up

Complete the following steps to clean up your resources:

  1. Delete the IdP applications that you created to integrate with IAM Identity Center.
  2. Delete the IAM Identity Center configuration.
  3. Delete the Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
  4. Delete the IAM role and IAM policies that you created in this post.
  5. Delete the permission set from IAM Identity Center that you created for the Amazon Redshift Query Editor in the management account.
  6. Delete the S3 bucket and associated S3 Access Grants instance.

Conclusion

In this post, we explored how to integrate Amazon Redshift with S3 Access Grants using IAM Identity Center. We established cross-account access to enable centralized user authentication through IAM Identity Center in the delegated administrator account, while keeping Amazon Redshift and Amazon S3 isolated by business unit in separate member accounts. We also showed simplified versions of running COPY and UNLOAD commands as a federated IAM Identity Center user without using an IAM role ARN. This setup creates a robust and secure analytics environment that streamlines data access for business users.

For additional guidance and detailed documentation, refer to the following key resources:


About the Authors

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Laura is an Identity Solutions Architect at AWS, where she thrives on helping customers overcome security and identity challenges. In her free time, she enjoys wreck diving and traveling around the world.

Praveen Kumar Ramakrishnan is a Senior Software Engineer at AWS. He has nearly 20 years of experience spanning various domains including filesystems, storage virtualization and network security. At AWS, he focuses on enhancing the Redshift data security.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Securing Amazon S3 presigned URLs for serverless applications

Post Syndicated from Raaga N.G original https://aws.amazon.com/blogs/compute/securing-amazon-s3-presigned-urls-for-serverless-applications/

Modern serverless applications must be capable of seamlessly handling large file uploads. This blog demonstrates how to leverage Amazon Simple Storage Service (Amazon S3) presigned URLs to allow your users to securely upload files to S3 without requiring explicit permissions in the AWS Account. This blog post specifically focuses on the security ramifications of using S3 presigned URLs, and explains mitigation steps that serverless developers can take to improve the security of their systems using S3 presigned URLs. Additionally, the blog post also walks through an AWS Lambda function that adheres to the provided recommendations, ensuring a robust and secure approach to handling S3 presigned URLs. For more information on S3 presigned URLs, see Working with presigned URLs.

Presigned URL Workflow for Serverless Applications

The following architecture diagram illustrates a serverless application that generates an S3 presigned URL. By using S3 presigned URLs, serverless applications can offload to S3 the computation required to receive files. The diagram captures a seven-step process between the client, Amazon API Gateway, the Lambda function, and S3.

A typical workflow to upload a file to a serverless application hosted on S3 includes the following steps:

  1. Client submits a request to upload a file.
  2. API Gateway receives the client request and invokes a Lambda function that then generates the S3 presigned URL.
  3. The Lambda function makes a getSignedUrl API call to S3.
  4. S3 returns the presigned URL for the object to be uploaded.
  5. The Lambda function returns a presigned URL to the API.
  6. Client receives the S3 presigned URL to upload the file.
  7. Client uploads the file directly to S3 using the presigned URL.

How to Secure Presigned URLs

When designing a serverless application that utilizes S3 presigned URLs to store data in S3, a developer must consider several primary security aspects. S3 presigned URLs are public resources that do not authenticate users, and anyone in possession of a valid S3 presigned URL can access the associated resource. Consequently, it is important to implement additional security measures to ensure that these URLs are not misused or accessed by unauthorized parties. The following blog post contains techniques you can use to make your presigned URLs more secure.

1. Add a Content-MD5 checksum using the X-Amz-Signed header

When you upload an object to S3, you can include a precalculated checksum of the object as part of your request. S3 will perform an integrity check and verify if the object sent is the same as the object received. S3 supports the use of MD5 checksums to verify the integrity of objects uploaded. You provide the MD5 digest by including a Content-MD5 header in the initial PUT request. Upon receiving the object, S3 will calculate the MD5 digest and compare it with the one you originally provided. The upload operation succeeds only if both MD5 digests match, ensuring end-to-end data integrity. If an unintended party gets their hands on the S3 presigned URL, then they will not be able to use it without possessing the same object. This provides protection against arbitrary file uploads.

The key element for a developer to remember is that when the client uploads the file to the S3 presigned URL, it must supply the correct MD5 in Base64 using the Content-MD5 header. Developers can see a sample serverless application with client-side code to extract the MD5 digest, request a S3 presigned URL, and upload a file in this GitHub repositoryThis sample application uses NodeJS v20 in the Lambda function.

2. Expire the S3 presigned URLs 

An S3 presigned URL remains valid for the period of time specified when the URL is generated. It is important to ensure that the S3 presigned URL does not remain accessible for longer than required as it can be reused when still valid. You can define the expiration time of the S3 presigned URL by either passing X-Amz-Expires as a query parameter or by setting the expiresIn parameter when using the AWS SDK for JavaScript.

S3 validates the expiration time and date at the time of initial HTTP Request. However, to support situations where the connection drops and the client needs to restart uploading a file, you may want your S3 presigned URL to remain valid for the entire anticipated time needed to upload the file to S3. The challenge is to generate an S3 presigned URL that is valid long enough to accommodate the file’s upload, yet still short enough that you prevent reuse.

A solution we propose to overcome these challenges is to dynamically set the S3 presigned URL’s expiration time by using the browser Network Information API. Using this new API, when the client browser places the initial request for an S3 presigned URL, the client also transmits the file’s size and the network type, so the Lambda function can calculate the anticipated transfer time.

Within the Lambda function, we can now estimate the transfer time for this size of file on this type of network, using sample code as featured in this GitHub repository.

With the estimated transfer time calculated, the Lambda function can now request the S3 presigned URL and set the expiresIn parameter to the transfer time, resulting in an S3 presigned URL that is only available for the time needed to upload that size of file on this type of network.

If you are using the AWS SDK, you may also be using AWS Signature Version 4 (SigV4) to sign your requests. To create a defense in depth approach, which will place a ceiling on total expiration time, you can utilize condition keys in bucket policies. For an example policy, see Limiting presigned URL capabilities.

3. Generating a UUID to replace the uploaded filename

When an application allows a user to upload files, the application exposes itself to various security threats, such as path traversal attacks. Path traversal vulnerabilities allow attackers to access files that are not meant to be accessed or to overwrite files outside the intended directory structure. In order to secure your applications against such vulnerabilities, the most effective approach is to incorporate user input validation and sanitization. You can sanitize the filename by replacing it with a generated UUID (Universally Unique Identifier).

You can see an example function in the server-side code for Lambda in this GitHub repository.

4. Applying the Principle of Least Privilege and using a separate Lambda function to create S3 presigned URLs

The capabilities of an S3 presigned URL are constrained by the permissions of the principal that created it. To offer fine-grained access, the very first step in limiting use of an S3 presigned URL should be building a specific Lambda function that generates these URLs. By having a Lambda function dedicated to this purpose, you do not risk an overly permissive Lambda function. The second step is to limit your specific Lambda function’s access to S3.

Adhering to the Principle of Least Privilege, it’s important to restrict the Lambda function’s permissions to only the required prefixes in the bucket and allow it to perform only the required actions on the bucket, instead of granting full bucket access. This minimizes the potential attack surface and mitigates the risk of unintended data exposure or modification. It is important to limit the permissions to the minimum required set of actions and resources.

This example AWS Identity and Access Management (IAM) policy demonstrates how to grant the Lambda function read access (GET) to objects within the "Example-Prefix" prefix of a specific S3 bucket. The IAM policy is attached to the Lambda function via an execution role, which together establish what actions the Lambda function can perform.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadStatement",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
      ],
      "Effect": "Allow"
    }
  ]
}

This example IAM policy demonstrates how to grant the Lambda function permissions to upload (PUT) objects within the "Example-Prefix" prefix of a specific S3 bucket.

{   
    "Version": "2012-10-17",
    "Statement": [
        {   
            "Sid": "UploadStatement",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
            ],
            "Effect": "Allow"
        }
    ]
}

This approach will ensure that your Lambda function possesses the minimum required permissions to perform its intended tasks and reduces the risk of unintended data access or modification.

If you want to restrict the use of S3 presigned URLs and all S3 access to a particular network path, you can also define a network-path restriction policy on the S3 Bucket. This restriction on the bucket requires that all requests to the bucket originate from a specified network. AWS Prescriptive Guidance says, an extension of least privilege is to maintain a data perimeter that’s consistent with your organization’s needs. The goal of an AWS perimeter is to ensure that the access is allowed only if the request is coming from a trusted entity, for trusted resources from a trusted network. These data perimeters are applicable to S3 presigned URLs as well.

5.Creating one-time use S3 presigned URLs

Serverless applications developers may want each S3 presigned URL to only be used once. Developers can incorporate a token-based mechanism to facilitate secure one-time use of an S3 presigned URL. This involves generating unique tokens for each authorized user or client and associating these tokens with the S3 presigned URLs. When a client attempts to access the resource using the S3 presigned URL, they must provide the corresponding token for validation. This additional layer of security ensures that only authorized entities can access the S3 presigned URLs and the associated resources. Furthermore, you can leverage a database to track the issued tokens and expire them after each use. A solution to implement such a mechanism has been discussed in detail in How to securely transfer files with presigned URLs.

Cleaning up

You may clean up the sample application by deleting the API Gateway, Lambda function, and S3 bucket. In addition, please do not forget to delete any IAM execution roles you created for the Lambda function.

Conclusion

In this blog we have discussed various considerations that a developer must make when designing an application that leverages S3 presigned URLs. By incorporating robust security measures, such as proper access control, input sanitization, expiration handling and integrity checks, developers can mitigate potential risks when using S3 presigned URLs.

AWS Weekly Roundup: Amazon Nova Premier, Amazon Q Developer, Amazon Q CLI, Amazon CloudFront, AWS Outposts, and more (May 5, 2025)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-nova-premier-amazon-q-developer-amazon-q-cli-amazon-cloudfront-aws-outposts-and-more-may-5-2025/

Last week I went to Thailand to attend the AWS Summit Bangkok. It was an energizing and exciting event. We hosted the Developer Lounge, where developers can meet, discuss ideas, enjoy lightning talks, win SWAGs at AWS Builder ID Prize Wheel, take a challenge at Amazon Q Developer Coding Challenge, or learn Generative AI at Learn Amazon Bedrock booth.

Here’s a quick look:

Thank you to AWS Heroes, AWS Community Builders, AWS User Group leaders and developers for your collaboration.

Coming up next in ASEAN is AWS Summit Singapore—make sure you don’t miss it by registering now.

Last Week’s Launches
Here are some launches last week that caught my attention:

  • Amazon Nova Premier Now Generally Available — Amazon Nova Premier, our most capable model for complex tasks and teacher for model distillation, is now generally available in Amazon Bedrock. It excels at complex tasks requiring deep context understanding and multistep planning, while processing text, images, and videos with a 1M token context length. With Nova Premier and Amazon Bedrock Model Distillation, you can create highly capable, cost-effective, and low-latency versions of Nova Pro, Lite, and Micro, for your specific needs.

  • Amazon Q Developer elevates the IDE experience with new agentic coding experience — This new interactive, agentic coding experience for Visual Studio Code allows Q Developer to intelligently take actions on behalf of the developer. Amazon Q Developer introduces an interactive coding experience in Visual Studio Code, offering real-time collaboration for coding, documentation, and testing. It provides transparent reasoning, and supports automated or step-by-step changes in multiple languages.

  • New Foundation Models in Amazon Bedrock — Amazon Bedrock expands its model offerings with two significant additions:
    • Writer’s Palmyra X5 and X4 models feature extensive context windows (1M and 128K tokens respectively) and excel in complex reasoning for enterprise applications. They support multistep tool-calling and adaptive thinking with high reliability standards.
    • Meta’s Llama 4 Scout 17B and Maverick 17B models offer natively multimodal capabilities using mixture-of-experts architecture for enhanced reasoning and image understanding. They support multiple languages and extended context processing, with simplified integration through the Bedrock Converse API.
  • Second-Generation AWS Outposts Racks Released AWS announces the general availability of second-generation Outposts racks with significant enhancements including the latest x86 EC2 instances, simplified networking, and accelerated networking options. These improvements deliver doubled vCPU, memory, and network bandwidth, 40% better performance, and support for ultra-low latency workloads, making them ideal for demanding on-premises deployments.

  • Amazon CloudFront SaaS Manager Launches — Amazon CloudFront SaaS Manager helps SaaS providers and web hosting platforms efficiently manage content delivery across multiple customer domains. The service dramatically reduces operational complexity while providing high-performance content delivery and enterprise-grade security for every customer domain.

  • Amazon Aurora Now Supports PostgreSQL 17 — Amazon Aurora now supports PostgreSQL 17.4, offering community improvements and Aurora-specific enhancements like optimized memory management and faster failovers. The release includes new features for Babelfish, security fixes, and updated extensions, available in all AWS Regions.
  • CloudWatch Introduces Tiered Pricing for Lambda Logs — Amazon CloudWatch launches tiered pricing for AWS Lambda logs and new delivery destinations. Pricing in US East starts at $0.50/GB for CloudWatch and $0.25/GB for S3 and Firehose, both tiering down to $0.05/GB. This update enhances flexibility in log management across all supporting Regions.
  • RDS for MySQL Updates Minor VersionsAmazon RDS for MySQL now supports minor versions 8.0.42 and 8.4.5, delivering security fixes, bug fixes, and performance improvements. Users can upgrade automatically during maintenance windows or use Blue/Green deployments for safer updates.
  • Amazon Bedrock Model Distillation Generally AvailableAmazon Bedrock Model Distillation is now generally available, supporting new models like Amazon Nova and Claude 3.5. It enables smaller models to accurately predict function calling for Agents, delivering up to 500% faster responses and 75% lower costs with minimal accuracy loss for RAG use cases. The service includes automated workflows for data synthesis and student model training.
  • AI Search Flow Builder for Amazon OpenSearch Service Amazon OpenSearch Service now offers an AI search flow builder for OpenSearch 2.19+ domains. This low-code designer enables creation of sophisticated AI-enhanced search flows using AWS and third-party services, supporting use cases like RAG, query rewriting, and semantic encoding.

From Community.AWS
Here’s my personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

  • AWS Summit — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Poland (6 May), Bengaluru (May 7 – 8), Hong Kong (May 8), Seoul (May 14-15), Singapore (May 29), and Sydney (June 4–5).
  • AWS re:Inforce – Mark your calendars for AWS re:Inforce (June 16–18) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. You can subscribe for event updates now!
  • AWS Partners Events – You’ll find a variety of AWS Partner events that will inspire and educate you, whether you are just getting started on your cloud journey or you are looking to solve new business challenges.
  • AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Yerevan, Armenia (May 24), Zurich, Switzerland (May 25), and Bengaluru, India (May 25).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

How to use AWS Transfer Family and GuardDuty for malware protection

Post Syndicated from James Abbott original https://aws.amazon.com/blogs/security/how-to-use-aws-transfer-family-and-guardduty-for-malware-protection/

Organizations often need to securely share files with external parties over the internet. Allowing public access to a file transfer server exposes the organization to potential threats, such as malware-infected files uploaded by threat actors or inadvertently by genuine users. To mitigate this risk, companies can take steps to help make sure that files received through public channels are scanned for malware before processing.

This post demonstrates how to use AWS Transfer Family and Amazon GuardDuty to scan files uploaded through a secure FTP (SFTP) server for malware as part of an overall transfer workflow. For readers who might have read an earlier blog post on this topic, the key difference is that this solution is fully managed and doesn’t require the deployment of compute resources. GuardDuty automatically updates malware signatures every 15 minutes instead of using a container image for scanning, avoiding the need for manual patching to keep the signatures up to date.

Prerequisites

To deploy the solution in this post, you will need:

  • An AWS account: You need access to AWS to deploy this solution. If you don’t have an account that you can use, see Start building on AWS today.
  • AWS CLI: Install and configure the AWS Command Line Interface (AWS CLI) to be authenticated to your AWS account. Set up the environment variables for your AWS account using the access token and secret access key for your environment.
  • Git: You will use Git to pull down the example code from GitHub.
  • Terraform: You’ll use Terraform to run the automation. Follow the Terraform installation instructions to download and set up Terraform.

Solution overview

This solution uses Transfer Family and GuardDuty. Transfer Family provides a secure file transfer service that you can use to set up an SFTP server, and GuardDuty is an intelligent threat detection service. GuardDuty monitors for malicious activity and anomalous behavior to protect AWS accounts, workloads, and data. At a high level, the solution uses the following steps:

  • A user uploads a file through a Transfer Family SFTP server.
  • A Transfer Family managed workflow invokes AWS Lambda to execute an AWS Step Functions workflow.
    • The workflow begins only after a successful file upload.
    • Partial uploads to the SFTP server will invoke an error handling Lambda function to report a partial upload error.
  • A step function state machine invokes a Lambda function to move uploaded files to an Amazon Simple Storage Service (Amazon S3) bucket for processing and then starts scanning using GuardDuty.
  • The GuardDuty scan result is sent as a callback to the step function.
  • Infected files are moved or cleaned.
  • The workflow sends the user the results through an Amazon Simple Notification Service (Amazon SNS) topic. This can be a notification of an error or malicious upload during the scan or notification of a successful upload and a clean scan for further processing.

Solution architecture and walkthrough

The solution uses GuardDuty Malware Protection for S3 to scan newly uploaded objects to the S3 bucket. You can use this feature of GuardDuty to set up a malware protection plan for an S3 bucket at the bucket level or to watch for specific object prefixes.

Figure 1: Solution architecture

Figure 1: Solution architecture

The following steps (shown in Figure 1) describe the workflow for this solution starting from the point the file is uploaded until it’s scanned and marked as safe or as infected, leading to subsequent steps that can be customized based on your use case.

  1. A file is uploaded using the SFTP protocol through Transfer Family.
  2. If the file is successfully uploaded, Transfer Family uploads the file to the S3 bucket called Unscanned and the Managed Workflow Complete workflow is triggered. This is the workflow used to handle successful uploads and invokes the Step Function Invoker Lambda function.
  3. The Step Function Invoker starts the state machine and kicks off the first step in the process by invoking the GuardDuty – Scan Lambda function.
  4. The GuardDuty – Scan function moves the file to the Processing bucket. This is the bucket from which the files will be scanned.
  5. When an object upload activity is detected, GuardDuty automatically scans the object. In this implementation, a malware protection plan is created for the Processing bucket.
  6. When a scan completes, GuardDuty publishes the scan result to Amazon EventBridge.
  7. An EventBridge rule has been created to invoke a Lambda Callback function whenever a scan event has completed. EventBridge will invoke the function with an event that contains the scan results. See Monitoring S3 object scans with Amazon EventBridge for an example.
  8. The Lambda Callback function notifies the GuardDuty – Scan task using the callback task integration pattern. The results of the GuardDuty scan are returned to the GuardDuty – Scan function and these results are passed to the Move File task.
  9. If the result is a clean scan with no threats detected, the Move File task will place the file in the Clean S3 bucket, indicating that the file is successfully scanned and safe for further processing.
  10. At this point, the Move File function publishes a notification to the Success SNS topic to notify the subscribers.
  11. If the result indicates that the file is malicious, the Move File function will instead move the file to the Quarantine S3 bucket for further investigation. The function will also delete the file from the Processing bucket and publish a notification in the Error topic in SNS to notify the user of a potential malicious file being uploaded.
  12. If the file upload is unsuccessful and the file isn’t fully uploaded, then Transfer Family will trigger the Managed Workflow Partial workflow.
  13. Managed Workflow Partial is an error handling workflow and invokes the Error Publisher function, which is used for reporting errors that occur anywhere in the workflow.
  14. The Error Publisher function identifies the type of error—whether it’s because of the partial upload or an issue elsewhere in the workflow—and sets the error status accordingly. It will then publish an error message to Error Topic in SNS.
  15. The GuardDuty – Scan task has a timeout to make sure that an event is published to Error Topic to prompt a manual intervention to investigate further if the file isn’t successfully scanned. If the GuardDuty – Scan task fails, the Error clean up Lambda function is invoked.

Finally, there’s an S3 Lifecycle policy attached to the Processing bucket. This is to make sure that no file is left in the Processing bucket for more than one day.

Code repository

The GitHub AWS-samples repository has a sample implementation developed using Terraform and Python-based Lambda functions to implement this solution. The same solution can also be implemented using AWS CloudFormation. The code has the components needed to deploy the entire workflow to demonstrate the abilities of Transfer Family and the GuardDuty malware protection plan.

Install the solution

Use the following steps to deploy this solution to your test environment.

  1. Clone the repository to your working directory using Git.
  2. Navigate to the root directory of your cloned project directory.
  3. Update the terraform locals.tf file with the values of your choice for the S3 bucket names, SFTP server names, and other variables.
  4. Run terraform plan.
  5. If everything looks good, run a terraform apply and enter yes to create the resources.

Clean up

After testing and exploring the solution, it’s important to clean up the resources you created to avoid incurring unnecessary costs. To delete the resources created by this solution, navigate to the root directory of your cloned project and run the following command:

terraform destroy

This command will remove the resources created by Terraform, including the SFTP server, S3 buckets, Lambda functions, and other components. Confirm the deletion by entering yes when prompted.

Conclusion

By using the approach outlined in the post, you can make sure that the files received over SFTP and uploaded to your S3 bucket are scanned for threats and are safe for further processing. The solution reduces the exposure surface by making sure that public uploads are scanned in a safe environment before they’re sent to other components of your system.

If you have feedback about this post, submit comments in the Comments section below.

James Abbott

James Abbott

James is a Principal Solutions Architect at AWS, working in Global Financial Services. When not in the office he enjoys mountain biking in North Carolina.

Santhosh Srinivasan

Santhosh Srinivasan

Santhosh is a Sr. Cloud Application Architect with the Professional Services team at AWS. He specializes in building and modernizing large scale enterprise applications in the cloud with a focus on the financial services industry.

Suhas Pasricha

Suhas Pasricha

Suhas is a Cloud Infrastructure Architect in the AWS Professional Services team. He has a background in web development and infrastructure automation. At Amazon, he has been helping customers set up and operate an enterprise-wide landing zone and cloud environment. In his spare time, he likes to read and play video games.

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/accelerate-your-analytics-with-amazon-s3-tables-and-amazon-sagemaker-lakehouse/

Amazon SageMaker Lakehouse is a unified, open, and secure data lakehouse that now seamlessly integrates with Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support. With this integration, SageMaker Lakehouse provides unified access to S3 Tables, general purpose Amazon S3 buckets, Amazon Redshift data warehouses, and data sources such as Amazon DynamoDB or PostgreSQL. You can then query, analyze, and join the data using Redshift, Amazon Athena, Amazon EMR, and AWS Glue. In addition to your familiar AWS services, you can access and query your data in-place with your choice of Iceberg-compatible tools and engines, providing you the flexibility to use SQL or Spark-based tools and collaborate on this data the way you like. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with AWS Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines.

Organizations are becoming increasingly data driven, and as data becomes a differentiator in business, organizations need faster access to all their data in all locations, using preferred engines to support rapidly expanding analytics and AI/ML use cases. Let’s take an example of a retail company that started by storing their customer sales and churn data in their data warehouse for business intelligence reports. With massive growth in business, they need to manage a variety of data sources as well as exponential growth in data volume. The company builds a data lake using Apache Iceberg to store new data such as customer reviews and social media interactions.

This enables them to cater to their end customers with new personalized marketing campaigns and understand its impact on sales and churn. However, data distributed across data lakes and warehouses limits their ability to move quickly, as it may require them to set up specialized connectors, manage multiple access policies, and often resort to copying data, that can increase cost in both managing the separate datasets as well as redundant data stored. SageMaker Lakehouse addresses these challenges by providing secure and centralized management of data in data lakes, data warehouses, and data sources such as MySQL, and SQL Server by defining fine-grained permissions that are consistently applied across data in all analytics engines.

In this post, we guide you how to use various analytics services using the integration of SageMaker Lakehouse with S3 Tables. We begin by enabling integration of S3 Tables with AWS analytics services. We create S3 Tables and Redshift tables and populate them with data. We then set up SageMaker Unified Studio by creating a company specific domain, new project with users, and fine-grained permissions. This lets us unify data lakes and data warehouses and use them with analytics services such as Athena, Redshift, Glue, and EMR.

Solution overview

To illustrate the solution, we are going to consider a fictional company called Example Retail Corp. Example Retail’s leadership is interested in understanding customer and business insights across thousands of customer touchpoints for millions of their customers that will help them build sales, marketing, and investment plans. Leadership wants to conduct an analysis across all their data to identify at-risk customers, understand impact of personalized marketing campaigns on customer churn, and develop targeted retention and sales strategies.

Alice is a data administrator in Example Retail Corp who has embarked on an initiative to consolidate customer information from multiple touchpoints, including social media, sales, and support requests. She decides to use S3 Tables with Iceberg transactional capability to achieve scalability as updates are streamed across billions of customer interactions, while providing same durability, availability, and performance characteristics that S3 is known for. Alice already has built a large warehouse with Redshift, which contains historical and current data about sales, customers prospects, and churn information.

Alice supports an extended team of developers, engineers, and data scientists who require access to the data environment to develop business insights, dashboards, ML models, and knowledge bases. This team includes:

Bob, a data analyst who needs to access to S3 Tables and warehouse data to automate building customer interactions growth and churn across various customer touchpoints for daily reports sent to leadership.

Charlie, a Business Intelligence analyst who is tasked to build interactive dashboards for funnel of customer prospects and their conversions across multiple touchpoints and make those available to thousands of Sales team members.

Doug, a data engineer responsible for building ML forecasting models for sales growth using the pipeline and/or customer conversion across multiple touchpoints and make those available to finance and planning teams.

Alice decides to use SageMaker Lakehouse to unify data across S3 Tables and Redshift data warehouse. Bob is excited about this decision as he can now build daily reports using his expertise with Athena. Charlie now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshift’s cost-based optimizer. Doug, being an open source Apache Spark contributor, is excited that he can build Spark based processing with AWS Glue or Amazon EMR to build ML forecasting models.

The following diagram illustrates the solution architecture.

Implementing this solution consists of the following high-level steps. For Example Retail, Alice as a data Administrator performs these steps:

  1. Create a table bucket. S3 Tables stores Apache Iceberg tables as S3 resources, and customer details are managed in S3 Tables. You can then enable integration with AWS analytics services, which automatically sets up the SageMaker Lakehouse integration so that the tables bucket is shown as a child catalog under the federated s3tablescatalog in the AWS Glue Data Catalog and is registered with AWS Lake Formation for access control. Next, you create a table namespace or database which is a logical construct that you group tables under and create a table using Athena SQL CREATE TABLE statement.
  2. Publish your data warehouse to Glue Data Catalog. Churn data is managed in a Redshift data warehouse, which is published to the Data Catalog as a federated catalog and is available in SageMaker Lakehouse.
  3. Create a SageMaker Unified Studio project. SageMaker Unified Studio integrates with SageMaker Lakehouse and simplifies analytics and AI with a unified experience. Start by creating a domain and adding all users (Bob, Charlie, Doug). Then create a project in the domain, choosing project profile that provisions various resources and the project AWS Identity and Access Management (IAM) role that manages resource access. Alice adds Bob, Charlie, and Doug to the project as members.
  4. Onboard S3 Tables and Redshift tables to SageMaker Unified Studio. To onboard the S3 Tables to the project, in Lake Formation, you grant permission on the resource to the SageMaker Unified Studio project role. This enables the catalog to be discoverable within the lakehouse data explorer for users (Bob, Charlie, and Doug) to start querying tables .SageMaker Lakehouse resources can now be accessed from computes like Athena, Redshift, and Apache Spark based computes like Glue to derive churn analysis insights, with Lake Formation managing the data permissions.

Prerequisites

To follow the steps in this post, you must complete the following prerequisites:

Alice completes the following steps to create the S3 Table bucket for the new data she plans to add/import into an S3 Table.

  1. AWS account with access to the following AWS services:
    • Amazon S3 including S3 Tables
    • Amazon Redshift
    • AWS Identity and Access Management (IAM)
    • Amazon SageMaker Unified Studio
    • AWS Lake Formation and AWS Glue Data Catalog
    • AWS Glue
  2. Create a user with administrative access.
  3. Have access to an IAM role that is a Lake Formation data lake administrator. For instructions, refer to Create a data lake administrator.
  4. Enable AWS IAM Identity Center in the same AWS Region where you want to create your SageMaker Unified Studio domain. Set up your identity provider (IdP) and synchronize identities and groups with AWS IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
  5. Create a read-only administrator role to discover the Amazon Redshift federated catalogs in the Data Catalog. For instructions, refer to Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  6. Create an IAM role named DataTransferRole. For instructions, refer to Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  7. Create an Amazon Redshift Serverless namespace called churnwg. For more information, see Get started with Amazon Redshift Serverless data warehouses.

Create a table bucket and enable integration with analytics services

Alice completes the following steps to create the S3 Table bucket for the new data she plans to add/import into an S3 Tables.

Follow the below steps to create a table bucket to enable integration with SageMaker Lakehouse:

  1. Sign in to the S3 console as user created in prerequisite step 2.
  2. Choose Table buckets in the navigation pane and choose Enable integration.
  3. Choose Table buckets in the navigation pane and choose Create table bucket.
  4. For Table bucket name, enter a name such as blog-customer-bucket.
  5. Choose Create table bucket.
  6. Choose Create table with Athena.
  7. Select Create a namespace and provide a namespace (for example, customernamespace).
  8. Choose Create namespace.
  9. Choose Create table with Athena.
  10. On the Athena console, run the following SQL script to create a table:
    CREATE TABLE customer (
      `c_salutation` string, 
      `c_preferred_cust_flag` string, 
      `c_first_sales_date_sk` int, 
      `c_customer_sk` int, 
      `c_login` string, 
      `c_current_cdemo_sk` int, 
      `c_first_name` string, 
      `c_current_hdemo_sk` int, 
      `c_current_addr_sk` int, 
      `c_last_name` string, 
      `c_customer_id` string, 
      `c_last_review_date_sk` int, 
      `c_birth_month` int, 
      `c_birth_country` string, 
      `c_birth_year` int, 
      `c_birth_day` int, 
      `c_first_shipto_date_sk` int, 
      `c_email_address` string)
      TBLPROPERTIES ('table_type' = 'iceberg')
      
    
    INSERT INTO customer VALUES
    ('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'[email protected]'),
    ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'[email protected]'),
    ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'[email protected]'),
    ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'[email protected]'),
    ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'[email protected]'),
    ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'[email protected]')

This is just an example of adding a few rows to the table, but generally for production use cases, customers use engines such as Spark to add data to the table.

S3 Tables customer is now created, populated with data and integrated with SageMaker Lakehouse.

Set up Redshift tables and publish to the Data Catalog

Alice completes the following steps to connect the data in Redshift to be published into the data catalog. We’ll also demonstrate how the Redshift table is created and populated, but in Alice’s case Redshift table already exists with all the historic data on sales revenue.

  1. Sign in to the Redshift endpoint churnwg as an admin user.
  2. Run the following script to create a table under the dev database under the public schema:
    CREATE TABLE customer_churn (
    customer_id BIGINT,
    tenure INT,
    monthly_charges DECIMAL(5,1),
    total_charges DECIMAL(5,1),
    contract_type VARCHAR(100),
    payment_method VARCHAR(100),
    internet_service VARCHAR(100),
    has_phone_service BOOLEAN,
    is_churned BOOLEAN
    );
    
    INSERT INTO customer_churn VALUES
    (10251783, 12, 70.5, 850.0, 'Month-to-Month', 'Credit Card', 'Fiber Optic', true, true),
    (13251813, 36, 55.0, 1980.0, 'One Year', 'Bank Transfer', 'DSL', true, false),
    (12755125, 6, 90.0, 540.0, 'Month-to-Month', 'Mailed Check', 'Fiber Optic', false, true),
    (26009249, 12, 70.5, 850.0, 'One Year', 'Credit Card', 'DSL', true, false),
    (3270685, 36, 55.0, 1980.0, 'One Year', 'Bank Transfer', 'DSL', true, false),
    (29033279, 6, 90.0, 540.0, 'Month-to-Month', 'Mailed Check', 'Fiber Optic', false, true),
    (6520539, 24, 60.0, 1440.0, 'Two Year', 'Electronic Check', 'DSL', true, false);

    This is just an example of adding a few rows to the table, but generally for production use cases, customers use several ways to add data to the table as documented in Loading data in Amazon Redshift.

  3. On the Redshift Serverless console, navigate to the namespace.
  4. On the Action dropdown menu, choose Register with AWS Glue Data Catalog to integrate with SageMaker Lakehouse.
  5. Choose Register.
  6. Sign in to the Lake Formation console as the data lake administrator.
  7. Under Data Catalog in the navigation pane, choose Catalogs and Pending catalog invitations.
  8. Select the pending invitation and choose Approve and create catalog.
  9. Provide a name for the catalog (for example, churn_lakehouse).
  10. Under Access from engines, select Access this catalog from Iceberg-compatible engines and choose DataTransferRole for the IAM role.
  11. Choose Next.
  12. Choose Add permissions.
  13. Under Principals, choose the datalakeadmin role for IAM users and roles, Super user for Catalog permissions, and choose Add.
  14. Choose Create catalog.

Redshift Table customer_churn is now created, populated with data and integrated with SageMaker Lakehouse.

Create a SageMaker Unified Studio domain and project

Alice now sets up SageMaker Unified Studio domain and projects so that she can bring users (Bob, Charlie and Doug) together in the new project.

Complete the following steps to create a SageMaker domain and project using SageMaker Unified Studio:

  1. On the SageMaker Unified Studio console, create a SageMaker Unified Studio domain and project using the All Capabilities profile template. For more details, refer to Setting up Amazon SageMaker Unified Studio. For this post, we create a project named churn_analysis.
  2. Setup AWS Identity center with users Bob, Charlie and Doug, Add them to domain and project.
  3. From SageMaker Unified Studio, navigate to the project overview and on the Project details tab, note the project role Amazon Resource Name (ARN).
  4. Sign in to the IAM console as an admin user.
  5. In the navigation pane, choose Roles.
  6. Search for the project role and add AmazonS3TablesReadOnlyAccess by choosing Add permissions.

SageMaker Unified Studio is now setup with domain, project and users.

Onboard S3 Tables and Redshift tables to the SageMaker Unified Studio project

Alice now configures SageMaker Unified Studio project role for fine-grained access control to determine who on her team gets to access what data sets.

Grant the project role full table access on customer dataset. For that, complete the following steps:

  1. Sign in to the Lake Formation console as the data lake administrator.
  2. In the navigation pane, choose Data lake permissions, then choose Grant.
  3. In the Principals section, for IAM users and roles, choose the project role ARN noted earlier.
  4. In the LF-Tags or catalog resources section, select Named Data Catalog resources:
    • Choose <account_id>:s3tablescatalog/blog-customer-bucket for Catalogs.
    • Choose customernamespace for Databases.
    • Choose customer for Tables.
  5. In the Table permissions section, select Select and Describe for permissions.
  6. Choose Grant.

Now grant the project role access to subset of columns  from customer_churn dataset.

  1. In the navigation pane, choose Data lake permissions, then choose Grant.
  2. In the Principals section, for IAM users and roles, choose the project role ARN noted earlier.
  3. In the LF-Tags or catalog resources section, select Named Data Catalog resources:
    • Choose <account_id>:churn_lakehouse/dev for Catalogs.
    • Choose public for Databases.
    • Choose customer_churn for Tables.
  4. In the Table Permissions section, select Select.
  5. In the Data Permissions section, select Column-based access.
  6. For Choose permission filter, select Include columns and choose customer_id, internet_service, and is_churned.
  7. Choose Grant.

All users in the project churn_analysis in SageMaker Unified Studio are now setup. They have access to all columns in the table and fine-grained access permissions for Redshift table where they have access to only three columns.

Verify data access in SageMaker Unified Studio

Alice can now do a final verification if the data is all available to ensure that each of her team members are set up to access the datasets.

Now you can verify data access for different users in SageMaker Unified Studio.

  1. Sign in to SageMaker Unified Studio as Bob and choose the churn_analysis
  2. Navigate to the Data explorer to view s3tablescatalog and churn_lakehouse under Lakehouse.

Data Analyst uses Athena for analyzing customer churn

Bob, the data analyst can now logs into to the SageMaker Unified Studio, chooses the churn_analysis project and navigates to the Build options and choose Query Editor under Data Analysis & Integration.

Bob chooses the connection as Athena (Lakehouse), the catalog as s3tablescatalog/blog-customer-bucket, and the database as customernamespace. And runs the following SQL to analyze the data for customer churn:

select * from "churn_lakehouse/dev"."public"."customer_churn" a, 
"s3tablescatalog/blog-customer-bucket"."customernamespace"."customer" b
where a.customer_id=b.c_customer_sk limit 10;

Bob can now join the data across S3 Tables and Redshift in Athena and now can proceed to build full SQL analytics capability to automate building customer growth and churn leadership daily reports.

BI Analyst uses Redshift engine for analyzing customer data

Charlie, the BI Analyst can now logs into the SageMaker Unified Studio and chooses the churn_analysis project. He navigates to the Build options and choose Query Editor under Data Analysis & Integration. He chooses the connection as Redshift (Lakehouse), Databases as dev, Schemas as public.

He then runs the follow SQL to perform his specific analysis.

select * from "dev@churn_lakehouse"."public"."customer_churn" a, 
"blog-customer-bucket@s3tablescatalog"."customernamespace"."customer" b
where a.customer_id=b.c_customer_sk limit 10;

Charlie can now further update the SQL query and use it to power QuickSight dashboards that can be shared with Sales team members.

Data engineer uses AWS Glue Spark engine to process customer data

Finally, Doug logs in to SageMaker Unified Studio as Doug and chooses the churn_analysis project to perform his analysis. He navigates to the Build options and choose JupyterLab under IDE & Applications. He downloads the churn_analysis.ipynb notebook and upload it into the explorer. He then runs the cells by selecting compute as project.spark.compatibility.

He runs the following SQL to analyze the data for customer churn:

Doug, now can use Spark SQL and start processing data from both S3 tables and Redshift tables and start  building forecasting models for customer growth and churn

Cleaning up

If you implemented the example and want to remove the resources, complete the following steps:

  1. Clean up S3 Tables resources:
    1. Delete the table.
    2. Delete the namespace in the table bucket.
    3. Delete the table bucket.
  2. Clean up the Redshift data resources:
    1. On the Lake Formation console, choose Catalogs in the navigation pane.
    2. Delete the churn_lakehouse catalog.
  3. Delete SageMaker project, IAM roles, Glue resources, Athena workgroup, S3 buckets created for domain.
  4. Delete SageMaker domain and VPC created for the setup.

Conclusion

In this post, we showed how you can use SageMaker Lakehouse to unify data across S3 Tables and Redshift data warehouses, which can help you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with Iceberg-compatible tools and engines. You can secure your data in the lakehouse by defining fine-grained permissions that are enforced across analytics and ML tools and engines.

For more information, refer to Tutorial: Getting started with S3 Tables, S3 Tables integration, and Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint. We encourage you to try out the S3 Tables integration with SageMaker Lakehouse integration and share your feedback with us.


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Aditya Kalyanakrishnan is a Senior Product Manager on the Amazon S3 team at AWS. He enjoys learning from customers about how they use Amazon S3 and helping them scale performance. Adi’s based in Seattle, and in his spare time enjoys hiking and occasionally brewing beer.

Announcing up to 85% price reductions for Amazon S3 Express One Zone

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/up-to-85-price-reductions-for-amazon-s3-express-one-zone/

At re:Invent 2023, we introduced Amazon S3 Express One Zone, a high-performance, single-Availability Zone (AZ) storage class purpose-built to deliver consistent single-digit millisecond data access for your most frequently accessed data and latency-sensitive applications.

S3 Express One Zone delivers data access speed up to 10 times faster than S3 Standard, and it can support up to 2 million GET transactions per second (TPS) and up to 200,000 PUT TPS per directory bucket. This makes it ideal for performance-intensive workloads such as interactive data analytics, data streaming, media rendering and transcoding, high performance computing (HPC), and AI/ML trainings. Using S3 Express One Zone, customers like Fundrise, Aura, Lyrebird, Vivian Health, and Fetch improved the performance and reduced the costs of their data-intensive workloads.

Since launch, we’ve introduced a number of features for our customers using S3 Express One Zone. For example, S3 Express One Zone started to support object expiration using S3 Lifecycle to expire objects based on age to help you automatically optimize storage costs. In addition, your log-processing or media-broadcasting applications can directly append new data to the end of existing objects and then immediately read the object, all within S3 Express One Zone.

Today we’re announcing that, effective April 10, 2025, S3 Express One Zone has reduced storage prices by 31 percent, PUT request prices by 55 percent, and GET request prices by 85 percent. In addition, S3 Express One Zone has reduced the per-GB charges for data uploads and retrievals by 60 percent, and these charges now apply to all bytes transferred rather than just portions of requests greater than 512 KB.

Here is a price reduction table in the US East (N. Virginia) Region:

Price Previous New Price reduction
Storage
(per GB-Month)
$0.16 $0.11 31%
Writes
(PUT requests)
$0.0025 per 1,000 requests up to 512 KB $0.00113 per 1,000 requests 55%
Reads
(GET requests)
$0.0002 per 1,000 requests up to 512 KB $0.00003 per 1,000 requests 85%
Data upload
(per GB)
$0.008 $0.0032 60%
Data retrievals
(per GB)
$0.0015 $0.0006 60%

For S3 Express One Zone pricing examples, go to the S3 billing FAQs or use the AWS Pricing Calculator.

These pricing reductions apply to S3 Express One Zone in all AWS Regions where the storage class is available: US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) Regions. To learn more, visit the Amazon S3 pricing page and S3 Express One Zone in the AWS Documentation.

Give S3 Express One Zone a try in the S3 console today and send feedback to AWS re:Post for Amazon S3 or through your usual AWS Support contacts.

Channy

Build unified pipelines spanning multiple AWS accounts and Regions with Amazon MWAA

Post Syndicated from Anubhav Gupta original https://aws.amazon.com/blogs/big-data/build-unified-pipelines-spanning-multiple-aws-accounts-and-regions-with-amazon-mwaa/

As organizations scale their Amazon Web Services (AWS) infrastructure, they frequently encounter challenges in orchestrating data and analytics workloads across multiple AWS accounts and AWS Regions. While multi-account strategy is essential for organizational separation and governance, it creates complexity in maintaining secure data pipelines and managing fine-grained permissions particularly when different teams manage resources in separate accounts.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the Amazon Cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. With Amazon MWAA, you can use Apache Airflow to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.

In this blog post, we demonstrate how to use Amazon MWAA for centralized orchestration, while distributing data processing and machine learning tasks across different AWS accounts and Regions for optimal performance and compliance.

Solution overview

Let’s consider an example of a global enterprise with distributed teams spread across different AWS regions. Each team generates and processes valuable data that is often required by other teams for comprehensive insights and streamlined operations. In this post, we consider a scenario where the data processing team sits in one region and the machine learning (ML) team sits in another region and there is a central team that manages the tasks between the two teams.

To address this complex challenge of orchestrating dependent teams across geographic regions, we’ve designed a data pipeline that spans multiple AWS accounts across different AWS Regions and is centrally orchestrated using Amazon MWAA. This design enables seamless data flow between teams, making sure that each team has access to the necessary data from other AWS accounts and Regions while maintaining compliance and operational efficiency.

Here’s a high-level overview of the architecture:

  • Centralized orchestration hub (Account A, us-east-1)
    • Amazon MWAA serves as the central orchestrator, coordinating operations across all regional data pipelines.
  • Regional data pipelines (Account B, two Regions)

This architecture maintains the concept of separate regional operations within Account B, with data processing in AWS Region 1 and ML in AWS Region 2. The central Amazon MWAA instance in Account A orchestrates these operations across AWS Regions, enabling different teams to work with the data they need. It enables scalability, automation, and streamlined data processing and ML workflows across multiple AWS environments.

Architecture Diagram

Prerequisites

 This solution requires two AWS accounts:

  • Account A: Central managed account for the Amazon MWAA environment.
  • Account B: Data processing and ML operations
    • Primary Region: US East (N. Virginia) [us-east-1]: Data processing workloads
    • Secondary Region: US West (Oregon) [us-west-2]: ML workloads

Step 1: Set up Account B (data processing and ML tasks)

Launch Button in us-east-1 and provide Account A as input. This template creates the following three stacks:

  • Stack in us-east-1: Creates the required roles for stackset execution.
  • Second stack in us-east-1: Creates an S3 bucket, S3 folders, and AWS Glue job.
  • Stack in us-west-2: Creates a S3 bucket, S3 folders, Amazon SageMaker Config file, cross-account-role, and AWS Lambda function.

Collect stack outputs: After successful deployment, gather the following output values from the created stacks. These outputs will be used in subsequent steps of the setup process.

  • From the us-east-1 stack:
    • The value of SourceBucketName
  • From the us-west-2 stack:
    • The value of DestinationBucketName
    • The value of CrossAccountRoleArn

 Step 2: Set up Account A (central orchestration)

Launch Button in us-east-1. Provide value of CrossAccountRoleArn from Account B setup as input. This template does the following:

  • Deploys an Amazon MWAA environment
  • Sets up an Amazon MWAA Execution role with a cross-account trust policy.

Step 3: Setting up S3 CRR and bucket policies in Account B

Launch Button in us-east-1 for cross-Region replication of the S3 data-processing bucket in us-east-1 and the ML pipeline bucket in us-west-1. Provide values of SourceBucketName, DestinationBucketName, and AccountAId as input parameters.

This stack should be deployed after completing the Amazon MWAA setup. This sequence is necessary because you need to grant the Amazon MWAA execution role appropriate permissions to access both the source and destination buckets.

Step 4: Implement cross-account, cross-Region orchestration

IAM cross-account role in Account B

The stack in Step 2 created an AWS Identity and Access Management (IAM) role in Account B with a trust relationship that allows the Amazon MWAA execution role from Account A (the central orchestration account) to assume it. Additionally, this role is granted the necessary permissions to access AWS resources in both Regions of Account B.

This setup enables the Amazon MWAA environment in Account A to securely perform actions and access resources across different Regions in Account B, maintaining the principle of least privilege while allowing for flexible, cross-account orchestration.

Airflow connection in Account A

To establish cross-account connections in Amazon MWAA:

Create a connection for us-east-1. Open the Airflow UI and navigate to Admin and then to Connections. Choose the plus (+) icon to add a new connection and enter the following details:

  • Connection ID: Enter aws_crossaccount_role_conn_east1
  • Connection type: Select Amazon Web Services.
  • Extras: Add the cross-account-role and Region name using the following code. Replace <CrossAccountRoleArn> with the cross-account role Amazon Resource Name (ARN) created while setting Account B in Step 1, in Region 2 (us-west-2):
{
"role_arn": "<CrossAccountRoleArn>",
"region_name": "us-east-1"
}

Create a second connection for us-west-2.

  • Connection ID: Enter aws_crossaccount_role_conn_west2
  • Connecton type: Select Amazon Web Services.
  • Extras: Add a CrossAccountRoleArn and Region name using the following code:
{
"role_arn": "<CrossAccountRoleArn>",
"region_name": "us-west-2"
}

By setting up these Airflow connections, Amazon MWAA can securely access resources in both us-east-1 and us-west-2, helping to ensure seamless workflow execution.

Implement cross-account workflows in Account A

Now that your environment is set up with the necessary IAM roles and Airflow connections, you can create data processing and ML workflows that span across accounts and Regions.

DAG 1: Cross-account data processing

Airflow DAG1 Workflow for Data Processing

The directed acyclic graph (DAG) depicted in the preceding figure demonstrates a cross-account data processing workflow using Amazon MWAA and AWS services.

To implement this DAG:

Here’s a description of its key operators:

  • S3KeySensor: This sensor monitors a specified S3 bucket for the presence of a raw data file (raw/ml_train_data.csv). It uses a cross-account AWS connection (aws_crossaccount_role_conn_east1) to access the S3 bucket in a different AWS account. The sensor checks every 60 seconds and times out after 1 hour if the file is not detected.
  • GlueJobOperator: This operator triggers an AWS Glue job (mwaa_glue_raw_to_transform) for data preprocessing. It passes the bucket name as a script argument to the AWS Glue job. Like the S3KeySensor, it uses the cross-account AWS connection to execute the AWS Glue job in the target account.

 DAG 2: Cross-account and cross-Region ML

Airflow DAG2 Workflow for Machine Learning

The DAG in the preceding figure demonstrates a cross-account machine learning workflow using Amazon MWAA and AWS services. It shows Airflow’s flexibility in enabling users to write custom operators for specific use cases, particularly for cross-account operations.

To implement this DAG:

Here’s a description of the custom operators and key components:

  • CrossAccountSageMakerHook: This custom hook extends the SageMakerHook to enable cross-account access. It uses AWS Security Token Service (AWS STS) to assume a role in the target account, enabling seamless interaction with SageMaker across account boundaries.
  • CrossAccountSageMakerTrainingOperator: Building on the CrossAccountSageMakerHook, this operator enables SageMaker training jobs to be executed in a different AWS account. It overrides the default SageMakerTrainingOperator to use the cross-account hook.
  • S3KeySensor: Used to monitor the presence of training data in a specified S3 bucket. These sensors verify that the required data is available before proceeding with the machine learning workflow. It uses a cross-account AWS connection (aws_crossaccount_role_conn_west2) to access the S3 bucket in a different AWS account.
  • SageMakerTrainingOperator: Uses the custom CrossAccountSageMakerTrainingOperator to initiate a SageMaker training job in the target account. The configuration for this job is dynamically loaded from an S3 bucket.
  • LambdaInvokeFunctionOperator: Invokes a Lambda function named dagcleanup after the SageMaker training job completes. This can be used for post-processing or cleanup tasks.

Step 5: Schedule and verify the Airflow DAGs

  1. To schedule the DAGs, copy the Python scripts cross_account_data_processing_dag.py and cross_account_machine_learning_dag.py to the S3 location associated with Amazon MWAA in central Account A. Go to the Airflow environment created in Account A, us-east-1, and locate the S3 bucket link and upload them to the dags folder.
  2. Download data file to the source bucket created in Account B, us-east-1, under raw folder.
  3. Navigate to the Airflow UI.
  4. Locate your DAG in the DAGs tab. The DAG automatically syncs from Amazon S3 to the Airflow UI. Choose the toggle button to enable the DAGs.
  5. Trigger the DAG runs.

DAGs Dashboard

Best practices for cross-account integration

When implementing cross-account, cross-Region workflows with Amazon MWAA, consider the following best practices to help ensure security, efficiency, and maintainability.

  • Secrets management: Use AWS Secrets Manager to securely store and manage sensitive information such as database credentials, API keys, or cross-account role ARNs. Rotate secrets regularly using Secrets Manager automatic rotation. For more information, see Using a secret key in AWS Secrets Manager for an Apache Airflow connection.
  • Networking: Choose the appropriate networking solution (AWS Transit Gateway, VPC Peering, AWS PrivateLink) based on your specific requirements, considering factors such as the number of VPCs, security needs, and scalability requirements. Implement appropriate security groups and network ACLs to control traffic flow between connected networks.
  • IAM role management: Follow the principle of least privilege when creating IAM roles for cross-account access.
  • Error handling and retries: Implement robust error handling in your DAGs to manage cross-account access issues. Use Airflow’s retry mechanisms to handle transient failures in cross-account operations.
  • Managing Python dependencies: Use a requirements.txt file to specify exact versions of required packages. Test your dependencies locally using the Amazon MWAA local runner before deploying to production. For more information, see Amazon MWAA best practices for managing Python dependencies

Clean up

To avoid future charges, remove any resources you created for this solution.

  • Empty the S3 buckets: Manually delete all objects within each bucket, verify they are empty, then delete the buckets themselves.
  • Delete the CloudFormation stacks: Identify and delete the stacks associated with the architecture.
  • Verify resource cleanup: Make sure that Amazon MWAA, AWS Glue, SageMaker, Lambda, and other services are terminated.
  • Remove remaining resources: Delete any manually created IAM roles, policies, or security groups.

Conclusion

By using Airflow connections, custom operators, and features such as Amazon S3 cross-Region replication, you can create a sophisticated workflow that seamlessly operates across multiple AWS accounts and Regions. This approach allows for complex, distributed data processing and machine learning pipelines that can take advantage of resources spread across your entire AWS infrastructure. The combination of cross-account access, cross-Region replication, and custom operators provides a powerful toolkit for building scalable and flexible data workflows. As always, careful planning and adherence to security best practices are crucial when implementing these advanced multi-account, multi-Region architectures.

Ready to tackle your own cross-account orchestration challenges? Test this approach and share your experience in the comments section.


About the authors

Suba Palanisamy is a Senior Technical Account Manager helping customers achieve operational excellence using AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games

Anubhav Gupta is a Solutions Architect at AWS supporting enterprise greenfield customers, focusing on the financial services industry. He has worked with hundreds of customers worldwide building their cloud foundational environments and platforms, architecting new workloads, and creating governance strategy for their cloud environments. In his free time, he enjoys traveling and spending time outdoors

Anusha Pininti is a Solutions Architect guiding enterprise greenfield customers through every stage of their cloud transformation, specializing in data analytics. She supports customers across various industries, helping them achieve their business objectives through cloud-based solutions. In her free time, Anusha loves to travel, spend time with family, and experiment with new dishes

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, watching TV shows, and playing Tabla

Geetha Penmatsa is a Solutions Architect supporting enterprise greenfield customers through their cloud journey. She helps customers across various industries transform their business with the AWS Cloud. She has a background in data analytics and is specializing in Amazon Connect Cloud contact center to help transform customer experience at scale. Outside work, Geetha loves to travel, ski, hike, and spend time with friends and family

Serverless ICYMI 2025 Q1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-2025-q1/

Welcome to the 28th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. At the end of a quarter, we share the most recent product launches, feature enhancements, blog posts, videos, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened in Q4 2024 here.

Serverless calendar Q1 2025

Serverless calendar Q1 2025

AWS Step Functions

The AWS Step Functions team continues to improve developer experience. Workflow Studio is now available within Visual Studio Code (VS Code) through the AWS Toolkit extension.

AWS Step Functions in IDE

AWS Step Functions in IDE

You can now design, test, and deploy your Step Functions workflows without leaving your IDE. The extension provides a drag-and-drop interface with all the familiar Workflow Studio capabilities, making it even easier to build state machines locally.

To get started, install the AWS Toolkit for Visual Studio Code and visit the user guide on Workflow Studio integration.

Step Functions private integrations now allows you to integrate applications seamlessly across private networks, on-premises infrastructure, and cloud platforms. Learn more in a blog post and explanation video.

AWS Step Functions private integrations video

AWS Step Functions private integrations video

Step Functions now integrates with 36 more AWS services that support user messaging capabilities. You can orchestrate notifications through Amazon SNS, Amazon SQS, Amazon EventBridge, Amazon Pinpoint, and more, all using the optimized integrations you’re familiar with.

Step Functions has increased the default quota for state machines and activities from 10,000 to 100,000 per AWS account. This tenfold increase means you can create more workflows to automate your business processes without worrying about hitting quota limits.

Distributed Map is expanding capabilities by adding support for JSON Lines (JSONL) format. JSONL, a highly efficient text-based format, stores structured data as individual JSON objects separated by newlines, making it particularly suitable for processing large datasets.

AWS Step Functions Distributed Map

AWS Step Functions Distributed Map

Distributed Map can also process data from a broader range of delimited file formats stored in Amazon S3 and offers new output transformations for greater control over result formatting.

Developer Tools

Serverless Land patterns are now available directly within VS Code.

You no longer need to switch between your IDE and external resources when building serverless architectures. Browse, search, and implement pre-built serverless patterns directly in VS Code.

Example Serverless Pattern

Example Serverless Pattern

AWS Lambda

Learn how AWS Lambda handles billions of invocations.

AWS Lambda asynchronous invocations

AWS Lambda asynchronous invocations

This blog post provides recommendations and insights for implementing highly distributed applications based on the Lambda service team’s experience building its robust asynchronous event processing system. It dives into challenges you might face, solution techniques, and best practices for handling noisy neighbors.

A new video walks through using the enhanced local IDE experience for Lambda developers.

AWS Lambda new IDE experience

AWS Lambda new IDE experience

The VS Code extension for Lambda now supports live tailing of CloudWatch Logs directly in your IDE following on from previous support for Live Tail in the Lambda console. Watch logs in real-time as your functions execute, making debugging and troubleshooting more efficient than ever.

You can now enable Application Performance Monitoring (APM) for Java and .NET runtimes using Amazon CloudWatch Application Signals.

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

This provides deep visibility into your function’s performance, including method-level tracing, memory profiling, and automated anomaly detection.

Amazon Bedrock features

Multi-agent collaboration is now available in Bedrock as a preview, enabling you to create systems where multiple AI agents work together to solve complex problems. Agents can specialize in different domains, share context, and coordinate their actions to achieve goals that would be difficult for a single agent.

RAG evaluation is now generally available. This provides metrics to assess and improve your retrieval augmented generation pipelines. GraphRAG for Bedrock Knowledge Bases is now generally available, allowing you to enhance retrievals with graph-based context.

Amazon Bedrock Flows now supports multi-turn conversations, allowing you to build dynamic AI applications that maintain context across multiple user interactions. Bedrock data automation is now generally available, streamlining the process of preparing, ingesting, and maintaining data for your GenAI applications. Bedrock now offers LLM-as-a-judge capability for model evaluation, providing automated assessment of model outputs without requiring human reviewers. Compare different models or prompt strategies against your specific criteria at scale.

Bedrock’s capabilities are now integrated into the Amazon SageMaker Unified Studio, creating a seamless experience for machine learning practitioners who want to incorporate foundation models into their workflows. Access Bedrock models, fine-tuning, and evaluation directly from SageMaker.

Amazon Nova is a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry leading price-performance. Nova has expanded its tool use and converse API capabilities, making it easier for developers to build AI assistants that can use external tools to complete tasks.

Amazon Bedrock Guardrails image content filters are now generally available. Define and enforce boundaries for your AI applications with controls for both text and image content, ensuring outputs align with your organization’s policies.

Bedrock Knowledge Bases now supports using your existing OpenSearch clusters as the vector storage backend. This integration allows you to leverage your investments in OpenSearch while benefiting from the managed RAG capabilities of Bedrock.

New Amazon Bedrock models

  • Anthropic’s Claude 3.7 Sonnet hybrid reasoning allows you to toggle between standard and extended thinking modes. In standard mode, it functions as an upgraded version of Claude 3.5 Sonnet. While in extended thinking mode, it employs self-reflection to achieve improved results across a wide range of tasks.
  • DeepSeek R1, an advanced model specialized in research and scientific reasoning excels at complex problem-solving tasks and technical content generation.
  • Cohere Embed 3 models are now available in both multilingual and English-specific versions. These embedding models support text and images, providing more accurate representation for multimodal content and improving retrieval augmented generation (RAG) applications.
  • Ray2, Luma AI’s new visual AI model is capable of creating realistic visuals with fluid, natural movement. You can use it for image understanding, 3D scene reconstruction, and visual content generation, opening new possibilities for immersive and visual applications.
  • Bedrock now supports fine-tuning of Meta’s latest Llama 3.2 models. These upgraded models deliver improved performance across reasoning, coding, and multilingual tasks while being more efficient with computational resources.

Amazon Q Developer

Amazon Q Developer is now available as a CLI agent, bringing AI-assisted development to the command line. Get contextual recommendations, generate shell commands, and solve coding problems without leaving your terminal.

Amazon Q CLI

Amazon Q CLI

Amazon Q Developer transformation now supports upgrading Java applications using Maven to Java 21. It offers enhanced code suggestions, refactoring, and optimization recommendations for applications using the latest Java features, like virtual threads and pattern matching.

AWS AppSync

AWS AppSync Events now supports events publishing for WebSocket APIs, enabling real-time publish-subscribe functionality. This feature makes it easier to build applications requiring instant updates, like chat applications, collaborative tools, and real-time dashboards.

AWS AppSync Events

AWS AppSync Events

There are new AWS Cloud Development Kit (AWS CDK) L2 constructs for AppSync WebSocket APIs. These make it simpler to define and deploy real-time APIs using infrastructure as code. These high-level constructs handle the details of WebSocket connections, authorization, and messaging patterns.

Amazon SNS

Amazon SNS now supports high throughput mode for SNS FIFO topics, with default throughput matching SNS standard topics. When you enable high-throughput mode, SNS FIFO topics will maintain order within message group, while reducing the de-duplication scope to the message-group level.

Amazon EventBridge

Amazon EventBridge now supports direct delivery to targets across AWS accounts, simplifying multi-account architectures. This reduces latency and improves reliability when routing events between accounts in your organization.

Amazon EventBridge cross account

Amazon EventBridge cross account

The EventBridge console now features event source discovery, making it easier to find and visualize available event sources in your AWS environment. This tool helps you identify potential event producers and understand the event schemas they emit.

AWS Amplify

AWS Amplify now offers a TypeScript data client optimized for server-side Lambda functions, providing type-safe access to your data sources. This client reduces code complexity and improves reliability when working with databases and APIs in server environments.

Serverless compute blog posts

January

February

March

Serverless Office Hours weekly livestream

February

March

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Developer Advocacy team members who work on Serverless to see the latest news, follow conversations, and interact with the team.

And finally, visit the Serverless Land  for all your serverless needs.

From virtual machine to Kubernetes to serverless: How dacadoo saved 78% on cloud costs and automated operations

Post Syndicated from Andreas Gehrig original https://aws.amazon.com/blogs/architecture/from-virtual-machine-to-kubernetes-to-serverless-how-dacadoo-saved-78-on-cloud-costs-and-automated-operations/

dacadoo is a global Swiss-based technology company that develops solutions for digital health engagement and health risk quantification. Their products include a software-as-a-service (SaaS)-based digital health engagement platform that uses behavioral science, AI, and gamification to help end users improve their health outcomes.

The company embarked on a journey to modernize an API to quantify health and lifestyle data plus a risk engine to calculate mortality and morbidity probabilities based on years of scientific research data.

To transform a virtual machine–based API service into a globally redundant, scalable health score and risk calculation solution dacadoo chose Amazon Web Services (AWS) technology. The service handles highly sensitive health data from a global customer base and must comply with regional regulations.

The result is a cost reduction of 78% and an infrastructure maintenance effort of less than an hour per year , allowing dacadoo to deliver and operate more AWS infrastructure without scaling its site reliability engineering (SRE) team, thanks to a high level of automation and an agile mindset.

In this post, we walk you step-by-step through dacadoo’s journey of embracing managed services, highlighting their architectural decisions as we go.

Background

The solution architecture went through a three-stage journey:

  1. Incubation – Single virtual machine on premises with disaster recovery (DR) in Switzerland
  2. Global and scalable – Multiple global Kubernetes clusters
  3. Operational excellence – Fully serverless and geo-redundant on AWS

Stage 1: Incubation with a virtual machine

After years of scientific research and development, the service was launched, running on a single on-premises virtual machine that used hypervisor technology to provide disaster recovery (DR). However, it had no high availability (HA) capability and it required manual recovery.

The application serving the API requests and the NoSQL database were both running on the same host. Software deployment and operating system maintenance were performed manually using Secure Shell (SSH)—a typical low-automation setup that also included downtime.

The following architecture diagram shows a virtual machine encompassing the monolithic application and its database.

Monolithic architecture

Challenges

A single virtual machine was quick to set up and inexpensive to operate, but it had considerable shortcomings. The health API was only available in Switzerland, infrastructure maintenance was performed manually, and software deployment was handled manually. Additionally, database backups were done using virtual machine snapshots, uptime monitoring only, and testing was conducted on the developer workstation.

Stage 2: Global and scalable with Kubernetes

At that time, dacadoo made a strategic decision to heavily invest in Kubernetes for managing containerized workloads on a global scale. As part of this technology rollout, the health score and risk service were migrated to Kubernetes.

Due to the geographically distributed customer base and low latency requirements, three Kubernetes clusters were deployed, one on each continent. The NoSQL database was hosted in proximity to the workload to reduce service latency and keep the migration effort low.

To reduce the operational maintenance, the NoSQL database was integrated as a SaaS offering, and monitoring was centralized using Datadog.

All cloud infrastructure was provisioned exclusively with Terraform, covering the Kubernetes cluster, NoSQL database , and integration with GitLab and Datadog.

dacadoo containerized the API service and used Gitlab continuous integration and continuous deployment (CI/CD) pipelines to deploy multiple environments and clusters on a global hyperscaler.

In retrospect, this was a typical replatform modernization project from virtual machine to Kubernetes, with a high level of automation and a SaaS-first approach.

The following diagram is the architecture for the container solution with managed NoSQL database.

Containers architecture

Challenges

The service faced several challenges, including increased costs from deploying three regional Kubernetes clusters across three environments, resulting in 27 cluster nodes and additional expenses from managing NoSQL database SaaS instances for each cluster. The complexity of CI/CD pipelines for multi-environment multi-cluster deployments added to the difficulty. Significant operational effort was required to keep infrastructure and Kubernetes components up to date.

Stage 3: Operational excellence with serverless

The Kubernetes-based architecture met the requirements, but some features in the dacadoo API service backlog needed to fit better with the application architecture at the time.

This was the right moment to take a holistic view of the infrastructure and software architecture and refactor the solution according to the latest AWS technologies and best practices, the next frontier for dacadoo’s engineering team.

Solution requirements

Requirements for the solution refactoring were as follows:

  • Keep the functionality of the API unmodified
  • Constrain data processing to a region of choice for compliance with local data protection laws
  • Avoid weekly patch cycles by exclusively using managed serverless services
  • Reduce costs by choosing services with a pay-as-you-go billing model
  • Delegate authentication to a dedicated service
  • Use an established web framework with an extensive ecosystem

Refactoring the apps

The API service has two components: a developer portal and the health score and risk calculations API. The database is only required for API keys, algorithm parameters, quotas, and usage statistics. Health data is processed regionally by the compute layer but not persisted, opening the door for a distributed database: Amazon DynamoDB global tables is the perfect fit for the solution. Writes are distributed to all connected Regions, whereas reads are local, providing low latency for complying with dacadoo service level agreements (SLAs).

The developer portal is a web UI with API documentation and API key management features. AWS Lambda is a great fit because it scales automatically and has a pay-per-request billing model.

The health and risk API uses algorithms implemented in the C programming language for short bursting, compute-intense simulations. These calls are wrapped by a REST API using the Python FastAPI framework. These characteristics make AWS Lambda a great fit.

Serverless architecture

HTTP requests are routed to the Lambda functions using Amazon API Gateway with AWS WAF for protection from malicious requests and attacks. Static assets are served from an Amazon Simple Storage Service (Amazon S3) bucket through API Gateway. The additional features of Amazon CloudFront aren’t required, and Amazon S3 reduces the complexity.

Amazon Route 53 provides a powerful feature known as latency-based routing, which allows it to direct DNS queries to the endpoint that offers the lowest latency for the requester.

This feature provides Regional high availability for API users without data processing location requirements. Alternatively, the user can call specific Regional endpoints to make sure requests are processed in the desired Region.

API authorization is HTTP header-based and is performed in the application with data stored in Amazon DynamoDB.

The following diagram is the architecture for a geo-redundant fully serverless solution.

Serverless architecture

With a dacadoo SRE team proficient in Python, they opted for Pulumi for its advanced features such as programming language flow control constructs, powerful configuration capabilities, and multi-cloud support.

For continuous integration, GitLab CI compiles the algorithm library, tests the FastAPI applications and packages everything. The application deployment is just an update of the AWS Lambda, a simple and reliable workflow.

Summary

The solution evolved from a managed infrastructure setup, where the customer held most of the responsibility, to an AWS managed service architecture.

Infrastructure provisioning evolved from manual, error-prone processes to powerful code-driven workflows in Pulumi. The SRE needed to enhance their software engineering skills to adopt Pulumi, transitioning from configuration-based approaches to designing and maintaining an infrastructure code base using object-oriented Python. This was part of dacadoo’s investment in the SRE team and broader modernization efforts. The serverless architecture enabled a GitOps engineering culture focused on productivity.

The transformation maximized scalability and availability while reducing costs and operational effort:

Virtual machine

  • Scalability: Low
  • Availability: Best effort
  • Infrastructure costs: Low
  • Maintenance effort: High

Kubernetes

  • Scalability: High
  • Availability: 99.95%
  • Infrastructure costs: High
  • Maintenance effort: Medium

Serverless

  • Scalability: Very high
  • Availability: 99.999% (with failover to another AWS Region)
  • Infrastructure costs: Low
  • Maintenance effort: Very low

The global redundancy elevates availability to an impressive 99.999% while keeping the costs low.

Conclusion

Migrating from a virtual machine to Kubernetes and ultimately to AWS Lambda demonstrates the progression of cloud engineering toward enhanced efficiency and scalability.

Each step in this journey reduced the complexity of managing resources while increasing flexibility and automation. Transitioning dacadoo’s API service to a fully serverless, geo-redundant architecture not only advanced the platform but also upskilled engineers, maintained a lean SRE team, and kept infrastructure costs low. Get started with your own AWS serverless solution.


About the Authors

Using Amazon S3 Tables with Amazon Redshift to query Apache Iceberg tables

Post Syndicated from Jonathan Katz original https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/

Amazon Redshift supports querying data stored using Apache Iceberg tables, an open table format that simplifies management of tabular data residing in data lakes on Amazon Simple Storage Service (Amazon S3). Amazon S3 Tables delivers the first cloud object store with built-in Iceberg support and streamlines storing tabular data at scale, including continual table optimizations that help improve query performance. Amazon SageMaker Lakehouse unifies your data across S3 data lakes, including S3 Tables, and Amazon Redshift data warehouses, helps you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data, querying data stored in S3 Tables without the need for complex extract, transform, and load (ETL) or data movement processes. You can take advantage of the scalability of S3 Tables to store and manage large volumes of data, optimize costs by avoiding additional data movement steps, and simplify data management through centralized fine-grained access control from SageMaker Lakehouse.

In this post, we demonstrate how to get started with S3 Tables and Amazon Redshift Serverless for querying data in Iceberg tables. We show how to set up S3 Tables, load data, register them in the unified data lake catalog, set up basic access controls in SageMaker Lakehouse through AWS Lake Formation, and query the data using Amazon Redshift.

Note – Amazon Redshift is just one option for querying data stored in S3 Tables. You can learn more about S3 Tables and additional ways to query and analyze data on the S3 Tables product page.

Solution overview

In this solution, we show how to query Iceberg tables managed in S3 Tables using Amazon Redshift. Specifically, we load a dataset into S3 Tables, link the data in S3 Tables to a Redshift Serverless workgroup with appropriate permissions, and finally run queries to analyze our dataset for trends and insights. The following diagram illustrates this workflow.

In this post, we will walk through the following steps:

  1. Create a table bucket in S3 Tables and integrate with other AWS analytics services.
  2. Set up permissions and create Iceberg tables with SageMaker Lakehouse using Lake Formation.
  3. Load data with Amazon Athena. There are different ways to ingest data into S3 Tables, but for this post, we show how we can quickly get started with Athena.
  4. Use Amazon Redshift to query your Iceberg tables stored in S3 Tables through the auto mounted catalog.

Prerequisites

The examples in this post require you to use the following AWS services and features:

Create a table bucket in S3 Tables

Before you can use Amazon Redshift to query the data in S3 Tables, you must first create a table bucket. Complete the following steps:

  1. In the Amazon S3 console, choose Table buckets on the left navigation pane.
  2. In the Integration with AWS analytics services section, choose Enable integration if you haven’t previously set this up.

This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Athena.

After a few seconds, the status will change to Enabled.

  1. Choose Create table bucket.
  2. Enter a bucket name. For this example, we use the bucket name redshifticeberg.
  3. Choose Create table bucket.

After the S3 table bucket is created, you will be redirected to the table buckets list.

Now that your table bucket is created, the next step is to configure the unified catalog in SageMaker Lakehouse through the Lake Formation console. This will make the table bucket in S3 Tables available to Amazon Redshift for querying Iceberg tables.

Publishing Iceberg tables in S3 Tables to SageMaker Lakehouse

Before you can query Iceberg tables in S3 Tables with Amazon Redshift, you must first make the table bucket available in the unified catalog in SageMaker Lakehouse. You can do this through the Lake Formation console, which lets you publish catalogs and manage tables through the catalogs feature, and assign permissions to users. The following steps show you how to set up Lake Formation so you can use Amazon Redshift to query Iceberg tables in your table bucket:

  1. If you’ve never visited the Lake Formation console before, you must first do so as an AWS user with admin permissions to activate Lake Formation.

You will be redirected to the Catalogs page on the Lake Formation console. You will see that one of the catalogs available is the s3tablescatalog, which maintains a catalog of the table buckets you’ve created. The following steps will configure Lake Formation to make data in the s3tablescatalog catalog available to Amazon Redshift.

Next, you need to create a database in Lake Formation. The Lake Formation database maps to a Redshift schema.

  1. Choose Databases under Data Catalog in the navigation pane.
  2. On the Create menu, choose Database.

  1. Enter a name for this database. This example uses icebergsons3.
  2. For Catalog, choose the table bucket that you created. In this example, the name will have the format <ACCOUNT ID>:s3tablescatalog/redshifticeberg.
  3. Choose Create database.

You will be redirected on the Lake Formation console to a page with more information about your new database. Now you can create an Iceberg table in S3 Tables.

  1. On the database details page, on the View menu, choose Tables.

This will open up a new browser window with the table editor for this database.

  1. After the table view loads, choose Create table to start creating the table.

  1. In the editor, enter the name of the table. We call this table examples.
  2. Choose the catalog (<ACCOUNT ID>:s3tablescatalog/redshifticeberg) and database (icebergsons3).

Next, add columns to your table.

  1. In the Schema section, choose Add column, and add a column that represents an ID.

  1. Repeat this step and add columns for additional data:
    1. category_id (long)
    2. insert_date (date)
    3. data (string)

The final schema looks like the following screenshot.

  1. Choose Submit to create the table.

Next, you need to set up a read-only permission so you can query Iceberg data in S3 Tables using the Amazon Redshift Query Editor v2. For more information, see Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.

  1. Under Administration in the navigation pane, choose Administrative roles and tasks.
  2. In the Data lake administrators section, choose Add.

  1. For Access type, select Read-only administrator.
  2. For IAM users and roles, enter AWSServiceRoleForRedshift.

AWSServiceRoleForRedshift is a service-linked role that’s managed by AWS.

  1. Choose Confirm.

You have now configured SageMaker Lakehouse using Lake Formation to allow Amazon Redshift to query Iceberg tables in S3 Tables. Next, you populate some data into the Iceberg table, and query it with Amazon Redshift.

Use SQL to query Iceberg data with Amazon Redshift

For this example, we use Athena to load data into our Iceberg table. This is one option for ingesting data into an Iceberg table; see Using Amazon S3 Tables with AWS analytics services for other options, including Amazon EMR with Spark, Amazon Data Firehose, and AWS Glue ETL.

  1. On the Athena console, navigate to the query editor.
  2. If this is your first time using Athena, you must first specify a query result location before executing your first query.
  3. In the query editor, under Data, choose your data source (AwsDataCatalog).
  4. For Catalog, choose the table bucket you created (s3tablescatalog/redshifticeberg).
  5. For Database, choose the database you created (icebergsons3).

  1. Let’s execute a query to generate data for the examples table. The following query generates over 1.5 million rows corresponding to 30 days of data. Enter the query and choose Run.
INSERT INTO icebergsons3.examples
SELECT
    b.id * (date_diff('day', CURRENT_DATE, a.insert_date) + 1),
    b.id % 1000, a.insert_date,
    CAST(random() AS varchar)
FROM
    unnest(
        sequence(CURRENT_DATE, CURRENT_DATE + INTERVAL '30' DAY, INTERVAL '1' DAY)
    ) AS a(insert_date),
    unnest(sequence(1, 50000)) AS b(id);

The following screenshot shows our query.

The query takes about 10 seconds to execute.

Now you can use Redshift Serverless to query the data.

  1. On the Redshift Serverless console, provision a Redshift Serverless workgroup if you haven’t already done so. For instructions, see Get started with Amazon Redshift Serverless data warehouses guide. In this example, we use a Redshift Serverless workgroup called iceberg.
  2. Make sure that your Amazon Redshift patch version is patch 188 or higher.

  1. Choose Query data to open the Amazon Redshift Query Editor v2.

  1. In the query editor, choose the workgroup you want to use.

A pop-up window will appear, prompting what user to use.

  1. Select Federated user, which will use your current account, and choose Create connection.

It will take a few seconds to start the connection. When you’re connected, you will see a list of available databases.

  1. Choose External databases.

You will see the table bucket from S3 Tables in the view (in this example, this is redshifticeberg@s3tablescatalog).

  1. If you continue clicking through the tree, you will see the examples table, which is the Iceberg table you previously created that’s stored in the table bucket.

You can now use Amazon Redshift to query the Iceberg table in S3 Tables.

Before you execute the query, review the Amazon Redshift syntax for querying catalogs registered in SageMaker Lakehouse. Amazon Redshift uses the following syntax to reference a table: [email protected] or database@namespace".schema.table.

In this example, we use the following syntax to query the examples table in the table bucket: r[email protected].

Learn more about this mapping in Using Amazon S3 Tables with AWS analytics services.

Let’s run some queries. First, let’s see how many rows are in the examples table.

  1. Run the following query in the query editor:
SELECT count(*)
FROM [email protected]; 

The query will take a few seconds to execute. You will see the following result.

Let’s try a slightly more complicated query. In this case, we want to find all the days that had example data starting with 0.2 and a category_id between 50–75 with at least 130 rows. We will order the results from most to least.

  1. Run the following query:
SELECT examples.insert_date, count(*)
FROM [email protected]
WHERE
    examples.data LIKE '0.2%' AND
    examples.category_id BETWEEN 50 AND 75
GROUP BY examples.insert_date
HAVING count(*) > 130
ORDER BY count DESC;

You might see different results than the following screenshot due the randomly generated source data.

Congratulations, you have set up and queried Iceberg data in S3 Tables from Amazon Redshift!

Clean up

If you implemented the example and want to remove the resources, complete the following steps:

  1. If you no longer need your Redshift Serverless workgroup, delete the workgroup.
  2. If you don’t need to access your SageMaker Lakehouse data from the Amazon Redshift Query Editor v2, remove the data lake administrator:
    1. On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
    2. Remove the read-only data lake administrator that has the AWSServiceRoleForRedshift privilege.
  3. If you want to permanently delete the data from this post, delete the database:
    1. On the Lake Formation console, choose Databases in the navigation pane.
    2. Delete the icebergsahead database.
  4. If you no longer need the table bucket, delete the table bucket.
  5. In you want to deactivate the integration between S3 Tables and AWS analytics services, see Migrating to the updated integration process.

Conclusion

In this post, we showed how to get started with Amazon Redshift to query Iceberg tables stored in S3 Tables. This is just the beginning for how you can use Amazon Redshift to analyze your Iceberg data that’s stored in S3 Tables—you can combine this with other Amazon Redshift features, including writing queries that join data from Iceberg tables stored in S3 Tables and Redshift Managed Storage (RMS), or implement data access controls that give you fine-granted access control rules for different users across the S3 Tables. Additionally, you can use features like Redshift Serverless to automatically select the amount of compute for analyzing your Iceberg tables, and use AI to intelligently scale on demand and optimize query performance characteristics for your analytical workload.

We invite you to leave feedback in the comments.


About the Authors

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Optimizing network footprint in serverless applications

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/optimizing-network-footprint-in-serverless-applications/

This post is authored by Anton Aleksandrov, Principal Solution Architect, AWS Serverless and Daniel Abib, Senior Specialist Solutions Architect, AWS

Serverless application developers may commonly encounter scenarios where they need to transport large payloads, especially when building modern cloud applications that need rich data. Examples include analytics services with detailed reports, e-commerce platforms with extensive product catalogs, healthcare applications transmitting patient records, or financial services aggregating transactional data.

Many serverless services have a well-defined maximum payload size. For example, AWS Lambda maximum request/response payload size is 6 MB, and Amazon Simple Queue Service (Amazon SQS) and Amazon EventBridge maximum message size is 256 KB. In this post, you will learn how to use data compression techniques to reduce your network footprint and transport larger payloads under existing constraints.

Overview

Cloud applications evolve continuously and need to be adjusted frequently for new requirements, such as new business features or new Service Level Objectives (SLO) for higher throughput and lower latency. As new use cases and data patterns are added, it is common to see request and response payload sizes increase. At some point, you might hit the maximum service payload size limits, such as 6 MB for synchronous Lambda function invokes, 10 MB for Amazon API Gateway, and 256 KB for Amazon SQS, EventBridge, and asynchronous Lambda invokes.

There are several techniques you can apply when dealing with large payloads. If your payloads are tens of MBs or more, or you need to transport large binary objects with API Gateway, you can store the payload on Amazon Simple Storage Service (Amazon S3) and use pre-signed URLs for clients to directly upload and download from S3.

A sample of architecture for handling large payloads.

Figure 1. A sample of architecture for handling large payloads

Lambda function URLs response streaming supports up to 20 MB responses. For handling large messages with services such as SQS or EventBridge, you can store the message in S3 and pass a reference. The downstream consumer will use the reference to download the message directly from S3. One common characteristic of these techniques is that they introduce architectural complexity and may necessitate modifications to your existing solution architecture and data flow patterns.

Furthermore, as your payloads grow in size, you will see increased data transfer costs, especially if your solution is transporting data through Amazon Virtual Private Cloud (VPC) NAT Gateways, VPC endpoints, or sending data across AWS Regions. For example, it is common for VPC-based solutions to have Lambda functions in their architecture. A container running on Amazon Elastic Kubernetes Service (Amazon EKS) might need to invoke a Lambda function, or a VPC-attached Lambda function might need to reach out to the public internet.

Examples of using virtual network appliances with serverless applications.

Figure 2. Examples of using virtual network appliances with serverless applications

Both NAT Gateway and VPC Endpoint are billed per GB of data processed, which makes data compression a valuable optimization technique. Go to NAT Gateway pricing and VPC Endpoint pricing for details.

The following sections explore data compression techniques and demonstrate how to apply them in your serverless applications. You can learn how to send larger payloads within the existing payload size boundaries and reduce your network footprint without significant architectural changes. This post discusses compression techniques in the context of Lambda and API Gateway, but the same principles can be applied to other services, such as SQS, EventBridge, and AWS AppSync. Understanding compression concepts better equips you to optimize your application’s data-handling capabilities.

What is data compression?

Compression is a widely used approach to reduce data size in order to improve cost-effectiveness and performance for data storage and transmission. Many tools and frameworks incorporate data compression techniques, such as gzip or zstd. It is thoroughly documented in the official IANA specification and IETF RFC 9110. Browsers such as Chrome and Firefox, HTTP toolkits such as curl and Postman, and runtimes such as Node.js and Python natively handle compression, often without user involvement.

Consider HTTP protocol. When a client wants to send a compressed payload, it specifies it in the Content-Type header. To receive a compressed response, the client specifies supported compression methods in the Accept-Encoding request header.

Accept-Encoding request header specifying supported compression methods.

Figure 3. Accept-Encoding request header specifying supported compression methods

The server compresses the response payload using one of the supported methods and uses the Content-Encoding response header to indicate the method to the client.

Content-Encoding response header specifying compression method.

Figure 4. Content-Encoding response header specifying compression method

This mechanism can accelerate client-server communications by reducing the number of bytes transmitted over the network. Compression efficiency depends on the data type. Text-based formats like JSON, XML, HTML, and YAML compress well, while binary data such as PDF and JPEG generally compress less effectively.

Data compression with API Gateway

API Gateway provides built-in compression support. Use the minimumCompressionSize configuration to set the smallest payload size to compress automatically. The value can be between 0 bytes to 10 MB. Compressing very small payloads might actually increase the final payload size, and you should always test with your real payload patterns to determine the optimal threshold.

Handling data compression in API Gateway.

Figure 5. Handling data compression in API Gateway

API Gateway enables clients to interact with your API using compressed payloads through supported content encodings. The compression mechanism works bi-directionally. For JSON payloads, API Gateway seamlessly handles compression and decompression, maintaining compatibility with mapping templates. It decompresses incoming payloads before applying request mapping templates and compresses outgoing responses after applying response mapping templates. This automated compression optimizes data transfer:

  • When sending compressed data, clients supply the appropriate Content-Encoding header. API Gateway handles the decompression and applies configured mapping templates before forwarding the request to the integration.
  • When API Gateway receives an integration response and compression is enabled, it compresses the response payload and returns it to the client, provided that the client has included a matching Accept-Encoding header.

A sample test using the compression technique with API Gateway and JSON payload yielded the following results.

  • Compression disabled. Response size = 1 MB, response latency = 660 ms
  • Compression enabled. Response size = 220 KB, response latency = 550 ms

Compressing data resulted in 78% network footprint reduction and improved latency by 110 ms.

This configuration-based technique uses the API Gateway native compression. However, payloads are decompressed before being delivered to downstream integrations, thus they still remain subject to Lambda’s 6 MB max payload size. To address this, you can configure binaryMediaTypes in the API Gateway to pass compressed payloads to Lambda directly, enabling the function to handle decompression.

CDK code to configure API Gateway for data compression and binary data passthrough.

Figure 6. CDK code to configure API Gateway for data compression and binary data passthrough

Handling compressed data in Lambda functions

The Lambda Invoke API supports payloads in plain-text formats, such as JSON. The maximum payload size is 6 MB for synchronous invocations and 256 KB for asynchronous. Although the Invoke API supports uncompressed text-based payloads, you can introduce data compression in your function code and use API Gateway or Function URLs to facilitate content conversion, as illustrated in the following figure.

Transporting compressed payloads in a serverless applications.

Figure 7. Transporting compressed payloads in a serverless applications

Handling data compression in your Lambda function code can be done through libraries commonly embedded in the runtime. The following code snippet shows the compressing response payload using Node.js. Similar techniques can be applied to other runtimes.

Sample code implementing response payload compression in a Lambda function.

Figure 8. Sample code implementing response payload compression in a Lambda function

  • Line 1: Import gzip functionality from the zlib module.
  • Lines 11: Compress and Base64-encode data. Gzip compression, similar to many other compression methods, produces a binary stream. Base64 encoding converts it to the text-based format expected by the Lambda service
  • Lines 13-21: Response object is created with isBase64Encoded=true and response headers telling the client that the response is a gzip-encoded JSON object.

The following screenshot shows the result: 20 MB uncompressed JSON returned from a Lambda function as a 2.5 MB compressed response body. Network footprint reduced by over 80%.

A screenshot from Postman showing the original and compressed payload size.

Figure 9. A screenshot from Postman showing the original and compressed payload size

Using this technique, you can reduce your network footprint and transport payload sizes several times higher than the Lambda maximum payload size.

Using Function URLs with compressed payloads

Transporting compressed payloads through Lambda Function URLs doesn’t necessitate any extra configuration. For handler responses, your code needs to compress and Base64-encode the data as shown in the preceding figure. For invocation requests, the Function URL endpoint recognizes the incoming compressed payload as binary and passes it to your handler as a Base64 encoded string in the event body.

Sample code implementing request payload decompression in a Lambda function.

Figure 10. Sample code implementing request payload decompression in a Lambda function

Trade-offs and testing results

Compressing data in function code is a CPU-intensive activity, potentially increasing invocation duration and, as a result, function cost. This, however, can be balanced by the benefits of data compression. As you’ve seen in previous sections, while compressing data adds compute latency, transporting smaller payloads over the network reduces network latency. The following section summarizes a series of tests performed to estimate the impact of data compression on Lambda function invocation duration, Lambda function invocation cost, and data transfer savings with both NAT Gateway and VPC Endpoint. The tests were performed with several assumptions and randomly generated JSON data. You can see full testing results in the sample GitHub.com repo.

Test results demonstrated that the impact on function latency and cost primarily depends on two key factors: payload size and allocated memory (which determines vCPU capacity). Using a Node.js runtime with ARM architecture as an example, compressing a 1 MB JSON object in a function with 1 GB of allocated memory resulted in 124 ms of added processing time on average. For 10 million invocations, this extra processing time adds approximately $16. At the same time, the compression yielded a 70% reduction in payload size. With the same number of invocations, this translates to approximately $300 in savings when using NAT Gateway and $70 in savings when using VPC Endpoints (depending on the number of Availability Zones (AZs)).

AWS Service pricing is updated regularly, you should always consult the respective pricing pages for the latest information. Moreover, you should conduct your own performance and cost estimates using payloads that represent your workloads. Compression effectiveness varies significantly depending on the data type: payloads with low compression rates might not benefit from this technique.

Sample application

Follow the instructions in this GitHub repository to provision the sample in your AWS account. The project creates two Lambda functions to demonstrate receiving and returning compressed JSON using Function URLs and API Gateway.

The sample shows how to GET and POST JSON payloads using gzip compression to reduce the network footprint by over 80%.

A screenshot from Postman showing the original and compressed payload size.

Figure 11. A screenshot from Postman showing the original and compressed payload size

Conclusion

Data compression enables larger payload transfers and reduces network footprint. It can help to lower network latencies and optimize data transfer costs. When implementing compression within Lambda functions, it is important to consider its CPU-bound nature, which may increase function duration and costs. You should always evaluate the added compute cost against potential data transfer savings to make sure the technique benefits your use case.

Compression is most effective for handling large text-based payloads and when a slight increase in compute latency balanced by reduced network latency is acceptable.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, see Serverless Land.

AWS Weekly Roundup: AWS Pi Day, Amazon Bedrock multi-agent collaboration, Amazon SageMaker Unified Studio, Amazon S3 Tables, and more

Post Syndicated from Prasad Rao original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-pi-day-amazon-bedrock-multi-agent-collaboration-amazon-sagemaker-unified-studio-amazon-s3-tables-and-more/

Thanks to everyone who joined us for the fifth annual AWS Pi Day on March 14. Since its inception in 2021, commemorating the Amazon Simple Storage Service (Amazon S3) 15th anniversary, AWS Pi Day has grown into a flagship event highlighting the transformative power of cloud technologies in data management, analytics, and AI.

This year’s virtual event featured in-depth discussions with Amazon Web Services (AWS) product teams showcasing our continued innovation in helping customers build robust data foundations for analytics and AI workloads.

Missed the live event? You can still access all content on-demand at the event page. Whether you’re developing data lakehouses, training AI models, creating generative AI applications, or optimizing analytics workloads, the shared insights will help you maximize the value of your data.

Last week’s launches
Here are some launches that got my attention during the previous week.

Amazon Bedrock now supports multi-agent collaboration – With the availability of multi-agent collaboration in Amazon Bedrock, you can create networks of specialized agents that communicate and coordinate under the guidance of a supervisor agent. You can build, deploy, and manage networks of AI agents that work together to execute complex, multi-step workflows efficiently.

Availability of fully managed DeepSeek-R1 model in Amazon Bedrock – AWS is the first cloud service provider (CSP) to deliver DeepSeek-R1 as a fully managed, generally available model. Use the capabilities of DeepSeek-R1 for your generative AI applications with a single API through this fully managed service in Amazon Bedrock.

Amazon SageMaker Unified Studio is now generally available – You can now use Amazon SageMaker Unified Studio as your single data and AI development environment, where you can find and access all of your organization’s data and work using the best tools for your specific needs. With the new simplified permissions management, you can easily bring your existing AWS resources into the unified studio. You’ll be able to find, access, and query your organization’s data and AI assets while collaborating with your team to securely build and share your analytics and AI artifacts—from data and models to generative AI applications.

Amazon Bedrock’s capabilities now generally available within Amazon SageMaker Unified Studio – SageMaker Unified Studio brings selected capabilities from Amazon Bedrock into SageMaker. You can now rapidly prototype, customize, and share generative AI applications using foundation models (FMs) and advanced features such as Amazon Bedrock Knowledge BasesAmazon Bedrock GuardrailsAmazon Bedrock Agents, and Amazon Bedrock Flows to create tailored solutions aligned with your requirements and responsible AI guidelines all within SageMaker.

Amazon S3 Tables integration with Amazon SageMaker Lakehouse is now generally availableAmazon S3 Tables now seamlessly integrate with Amazon SageMaker Lakehouse, making it easy for you to query and join S3 Tables with data in S3 data lakes, Amazon Redshift data warehouses, and third-party data sources. S3 Tables deliver the first cloud object store with built-in Apache Iceberg support.

Amazon S3 Tables now support create and query table operations directly from the S3 console using Amazon Athena – Amazon S3 Tables adds create and query table support in the S3 console. With this new feature, you can now create a table, populate it with data, and query it directly from the S3 console using Amazon Athena, making it easier to get started and analyze data in S3 table buckets.

Amazon S3 reduces pricing for S3 object tagging by 35% – Amazon S3 reduces pricing for S3 object tagging by 35% in all AWS Regions to $0.0065 per 10,000 tags per month. Object tags are key-value pairs applied to S3 objects that can be created, updated, or deleted at any time during the lifetime of the object.

Serverless Land Patterns available in Visual Studio CodeServerless Land‘s extensive application pattern library is now available directly into the Visual Studio Code (VS Code) IDE, making it easier for developers to build serverless applications. This integration eliminates the need to switch between your development environment and external resources when building serverless architectures by enabling you to browse, search, and implement pre-built serverless patterns directly in VS Code IDE.

Amplify Hosting Announces Skew Protection SupportAWS Amplify Hosting now offers Skew Protection, a feature that guarantees version consistency across your deployments. This feature ensures frontend requests are always routed to the correct server backend version—eliminating version skew and making deployments more reliable.

Amazon Route 53 Traffic Flow introduces a new visual editor to improve DNS policy editingAmazon Route 53 Traffic Flow now offers an enhanced user interface for improved DNS traffic policy editing. With this release, you can more easily understand and change the way traffic is routed between users and endpoints using the new features of the visual editor.

From community.aws
Here are some of my favorite posts from community.aws. Create your AWS Builder ID to start sharing your tips and connect with fellow builders. Your Builder ID is a universal login credential that gives you access, beyond the AWS Management Console, to AWS tools and resources, including over 600 free training courses, community features, and developer tools such as Amazon Q Developer.

Seamless SQL Server Recovery on EC2 with AWS Systems Manager (Greg Vinton) – This guide explains how to use the AWSEC2-RestoreSqlServerDatabaseWithVss automation runbook to restore a Microsoft SQL Server database on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Secure Deployment Strategies in Amazon EKS with Azure DevOps (Abhishek Nanda) – Build and Deploy containerized applications on Amazon Elastic Kubernetes Service (Amazon EKS) using Azure DevOps.

Connect Your Favorite LLM Client to Bedrock (Qinjie Zhang) – It’s common to use desktop applications like MSTY, Chatbox AI, LM Studio to simplify the use of Large Language Models (LLM) models. This blog provides a step-by-step guide on how you can connect your favorite local LLM clients to Amazon Bedrock.

From PHP to Python with the help of Amazon Q Developer (Ricardo Sueiras) – In this blog post, Ricardo showcases how to use Amazon Q Developer CLI to refactor code from one programming language to another.

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Milan, Italy (April 2), Bay Area – Security Edition (April 4), Timișoara, Romania (April 10), and Prague, Czech Republic (April 29).

AWS Innovate: Generative AI + Data – Join a free online conference focusing on generative AI and data innovations in Latin America on April 8.

AWS Summits – The AWS Summit season is coming along! Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Paris (April 9), Amsterdam (April 16), London (April 30), and Poland (May 5).

AWS re:Inforce (June 16–18) – Our annual learning event devoted to all things AWS Cloud security in Philadelphia, PA. Registration opens in March, so be ready to join more than 5,000 security builders and leaders.

AWS DevDays are free, technical events where developers can learn about some of the hottest topics in cloud computing. DevDays offer hands-on workshops, technical sessions, live demos, and networking with AWS technical experts and your peers. Register to access AWS DevDays sessions on demand.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Prasad

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

AWS Pi Day 2025: Data foundation for analytics and AI

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-pi-day-data-foundation-for-analytics-and-ai/

Every year on March 14 (3.14), AWS Pi Day highlights AWS innovations that help you manage and work with your data. What started in 2021 as a way to commemorate the fifteenth launch anniversary of Amazon Simple Storage Service (Amazon S3) has now grown into an event that highlights how cloud technologies are transforming data management, analytics, and AI.

This year, AWS Pi Day returns with a focus on accelerating analytics and AI innovation with a unified data foundation on AWS. The data landscape is undergoing a profound transformation as AI emerges in most enterprise strategies, with analytics and AI workloads increasingly converging around a lot of the same data and workflows. You need an easy way to access all your data and use all your preferred analytics and AI tools in a single integrated experience. This AWS Pi Day, we’re introducing a slate of new capabilities that help you build unified and integrated data experiences.

The next generation of Amazon SageMaker: The center of all your data, analytics, and AI
At re:Invent 2024, we introduced the next generation of Amazon SageMaker, the center of all your data, analytics, and AI. SageMaker includes virtually all the components you need for data exploration, preparation and integration, big data processing, fast SQL analytics, machine learning (ML) model development and training, and generative AI application development. With this new generation of Amazon SageMaker, SageMaker Lakehouse provides you with unified access to your data and SageMaker Catalog helps you to meet your governance and security requirements. You can read the launch blog post written by my colleague Antje to learn more details.

Core to the next generation of Amazon SageMaker is SageMaker Unified Studio, a single data and AI development environment where you can use all your data and tools for analytics and AI. SageMaker Unified Studio is now generally available.

SageMaker Unified Studio facilitates collaboration among data scientists, analysts, engineers, and developers as they work on data, analytics, AI workflows, and applications. It provides familiar tools from AWS analytics and artificial intelligence and machine learning (AI/ML) services, including data processing, SQL analytics, ML model development, and generative AI application development, into a single user experience.

SageMaker Unified Studio

SageMaker Unified Studio also brings selected capabilities from Amazon Bedrock into SageMaker. You can now rapidly prototype, customize, and share generative AI applications using foundation models (FMs) and advanced features such as Amazon Bedrock Knowledge BasesAmazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows to create tailored solutions aligned with your requirements and responsible AI guidelines all within SageMaker.

Last but not least, Amazon Q Developer is now generally available in SageMaker Unified Studio. Amazon Q Developer provides generative AI powered assistance for data and AI development. It helps you with tasks like writing SQL queries, building extract, transform, and load (ETL) jobs, and troubleshooting, and is available in the Free tier and Pro tier for existing subscribers.

You can learn more about SageMaker Unified Studio in this recent blog post written by my colleague Donnie.

During re:Invent 2024, we also launched Amazon SageMaker Lakehouse as part of the next generation of SageMaker. SageMaker Lakehouse unifies all your data across Amazon S3 data lakes, Amazon Redshift data warehouses, and third-party and federated data sources. It helps you build powerful analytics and AI/ML applications on a single copy of your data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with Apache Iceberg–compatible tools and engines. In addition, zero-ETL integrations automate the process of bringing data into SageMaker Lakehouse from AWS data sources such as Amazon Aurora or Amazon DynamoDB and from applications such as Salesforce, Facebook Ads, Instagram Ads, ServiceNow, SAP, Zendesk, and Zoho CRM. The full list of integrations is available in the SageMaker Lakehouse FAQ.

Building a data foundation with Amazon S3
Building a data foundation is the cornerstone of accelerating analytics and AI workloads, enabling organizations to seamlessly manage, discover, and utilize their data assets at any scale. Amazon S3 is the world’s best place to build a data lake, with virtually unlimited scale, and it provides the essential foundation for this transformation.

I’m always astonished to learn about the scale at which we operate Amazon S3: It currently holds over 400 trillion objects, exabytes of data, and processes a mind-blowing 150 million requests per second. Just a decade ago, not even 100 customers were storing more than a petabyte (PB) of data on S3. Today, thousands of customers have surpassed the 1 PB milestone.

Amazon S3 stores exabytes of tabular data, and it averages over 15 million requests to tabular data per second. To help you reduce the undifferentiated heavy lifting when managing your tabular data in S3 buckets, we announced Amazon S3 Tables at AWS re:Invent 2024. S3 Tables are the first cloud object store with built-in support for Apache Iceberg. S3 tables are specifically optimized for analytics workloads, resulting in up to threefold faster query throughput and up to tenfold higher transactions per second compared to self-managed tables.

Today, we’re announcing the general availability of Amazon S3 Tables integration with Amazon SageMaker Lakehouse  Amazon S3 Tables now integrate with Amazon SageMaker Lakehouse, making it easy for you to access S3 Tables from AWS analytics services such as Amazon Redshift, Amazon Athena, Amazon EMR, AWS Glue, and Apache Iceberg–compatible engines such as Apache Spark or PyIceberg. SageMaker Lakehouse enables centralized management of fine-grained data access permissions for S3 Tables and other sources and consistently applies them across all engines.

For those of you who use a third-party catalog, have a custom catalog implementation, or only need basic read and write access to tabular data in a single table bucket, we’ve added new APIs that are compatible with the Iceberg REST Catalog standard. This enables any Iceberg-compatible application to seamlessly create, update, list, and delete tables in an S3 table bucket. For unified data management across all of your tabular data, data governance, and fine-grained access controls, you can also use S3 Tables with SageMaker Lakehouse.

To help you access S3 Tables, we’ve launched updates in the AWS Management Console. You can now create a table, populate it with data, and query it directly from the S3 console using Amazon Athena, making it easier to get started and analyze data in S3 table buckets.

The following screenshot shows how to access Athena directly from the S3 console.

S3 console : create table with AthenaWhen I select Query tables with Athena or Create table with Athena, it opens the Athena console on the correct data source, catalog, and database.

S3 Tables in Athena

Since re:Invent 2024, we’ve continued to add new capabilities to S3 Tables at a rapid pace. For example, we added schema definition support to the CreateTable API and you can now create up to 10,000 tables in an S3 table bucket. We also launched S3 Tables into eight additional AWS Regions, with the most recent being Asia Pacific (Seoul, Singapore, Sydney) on March 4, with more to come. You can refer to the S3 Tables AWS Regions page of the documentation to get the list of the eleven Regions where S3 Tables are available today.

Amazon S3 Metadataannounced during re:Invent 2024— has been generally available since January 27. It’s the fastest and easiest way to help you discover and understand your S3 data with automated, effortlessly-queried metadata that updates in near real time. S3 Metadata works with S3 object tags. Tags help you logically group data for a variety of reasons, such as to apply IAM policies to provide fine-grained access, specify tag-based filters to manage object lifecycle rules, and selectively replicate data to another Region. In Regions where S3 Metadata is available, you can capture and query custom metadata that is stored as object tags. To reduce the cost associated with object tags when using S3 Metadata, Amazon S3 reduced pricing for S3 object tagging by 35 percent in all Regions, making it cheaper to use custom metadata.

AWS Pi Day 2025
Over the years, AWS Pi Day has showcased major milestones in cloud storage and data analytics. This year, the AWS Pi Day virtual event will feature a range of topics designed for developers and technical decision-makers, data engineers, AI/ML practitioners, and IT leaders. Key highlights include deep dives, live demos, and expert sessions on all the services and capabilities I discussed in this post.

By attending this event, you’ll learn how you can accelerate your analytics and AI innovation. You’ll learn how you can use S3 Tables with native Apache Iceberg support and S3 Metadata to build scalable data lakes that serve both traditional analytics and emerging AI/ML workloads. You’ll also discover the next generation of Amazon SageMaker, the center for all your data, analytics, and AI, to help your teams collaborate and build faster from a unified studio, using familiar AWS tools with access to all your data whether it’s stored in data lakes, data warehouses, or third-party or federated data sources.

For those looking to stay ahead of the latest cloud trends, AWS Pi Day 2025 is an event you can’t miss. Whether you’re building data lakehouses, training AI models, building generative AI applications, or optimizing analytics workloads, the insights shared will help you maximize the value of your data.

Tune in today and explore the latest in cloud data innovation. Don’t miss the opportunity to engage with AWS experts, partners, and customers shaping the future of data, analytics, and AI.

If you missed the virtual event on March 14, you can visit the event page at any time—we will keep all the content available on-demand there!

— seb


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Amazon S3 Tables integration with Amazon SageMaker Lakehouse is now generally available

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-s3-tables-integration-with-amazon-sagemaker-lakehouse-is-now-generally-available/

At re:Invent 2024, we launched Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support to streamline storing tabular data at scale, and Amazon SageMaker Lakehouse to simplify analytics and AI with a unified, open, and secure data lakehouse. We also previewed S3 Tables integration with Amazon Web Services (AWS) analytics services for you to stream, query, and visualize S3 Tables data using Amazon Athena, Amazon Data Firehose, Amazon EMR, AWS Glue, Amazon Redshift, and Amazon QuickSight.

Our customers wanted to simplify the management and optimization of their Apache Iceberg storage, which led to the development of S3 Tables. They were simultaneously working to break down data silos that impede analytics collaboration and insight generation using the SageMaker Lakehouse. When paired with S3 Tables and SageMaker Lakehouse in addition to built-in integration with AWS analytics services, they can gain a comprehensive platform unifying access to multiple data sources enabling both analytics and machine learning (ML) workflows.

Today, we’re announcing the general availability of Amazon S3 Tables integration with Amazon SageMaker Lakehouse to provide unified S3 Tables data access across various analytics engines and tools. You can access SageMaker Lakehouse from Amazon SageMaker Unified Studio, a single data and AI development environment that brings together functionality and tools from AWS analytics and AI/ML services. All S3 tables data integrated with SageMaker Lakehouse can be queried from SageMaker Unified Studio and engines such as Amazon Athena, Amazon EMR, Amazon Redshift, and Apache Iceberg-compatible engines like Apache Spark or PyIceberg.

With this integration, you can simplify building secure analytic workflows where you can read and write to S3 Tables and join with data in Amazon Redshift data warehouses and third-party and federated data sources, such as Amazon DynamoDB or PostgreSQL.

You can also centrally set up and manage fine-grained access permissions on the data in S3 Tables along with other data in the SageMaker Lakehouse and consistently apply them across all analytics and query engines.

S3 Tables integration with SageMaker Lakehouse in action
To get started, go to the Amazon S3 console and choose Table buckets from the navigation pane and select Enable integration to access table buckets from AWS analytics services.

Now you can create your table bucket to integrate with SageMaker Lakehouse. To learn more, visit Getting started with S3 Tables in the AWS documentation.

1. Create a table with Amazon Athena in the Amazon S3 console
You can create a table, populate it with data, and query it directly from the Amazon S3 console using Amazon Athena with just a few steps. Select a table bucket and select Create table with Athena, or you can select an existing table and select Query table with Athena.

2. Create tables with Athena

When you want to create a table with Athena, you should first specify a namespace for your table. The namespace in an S3 table bucket is equivalent to a database in AWS Glue, and you use the table namespace as the database in your Athena queries.

Choose a namespace and select Create table with Athena. It goes to the Query editor in the Athena console. You can create a table in your S3 table bucket or query data in the table.

2. Query with Athena

2. Query with SageMaker Lakehouse in the SageMaker Unified Studio
Now you can access unified data across S3 data lakes, Redshift data warehouses, third-party and federated data sources in SageMaker Lakehouse directly from SageMaker Unified Studio.

To get started, go to the SageMaker console and create a SageMaker Unified Studio domain and project using a sample project profile: Data Analytics and AI-ML model development. To learn more, visit Create an Amazon SageMaker Unified Studio domain in the AWS documentation.

After the project is created, navigate to the project overview and scroll down to project details to note down the project role Amazon Resource Name (ARN).

3. Project details in SageMaker Unified Studio

Go to the AWS Lake Formation console and grant permissions for AWS Identity and Access Management (IAM) users and roles. In the in the Principals section, select the <project role ARN> noted in the previous paragraph. Choose Named Data Catalog resources in the LF-Tags or catalog resources section and select the table bucket name you created for Catalogs. To learn more, visit Overview of Lake Formation permissions in the AWS documentation.

4. Grant permissions in Lake Formation console

When you return to SageMaker Unified Studio, you can see your table bucket project under Lakehouse in the Data menu in the left navigation pane of project page. When you choose Actions, you can select how to query your table bucket data in Amazon Athena, Amazon Redshift, or JupyterLab Notebook.

5. S3 Tables in Unified Studio

When you choose Query with Athena, it automatically goes to Query Editor to run data query language (DQL) and data manipulation language (DML) queries on S3 tables using Athena.

Here is a sample query using Athena:

select * from "s3tablecatalog/s3tables-integblog-bucket”.”proddb"."customer" limit 10;

6. Athena query in Unified Studio

To query with Amazon Redshift, you should set up Amazon Redshift Serverless compute resources for data query analysis. And then you choose Query with Redshift and run SQL in the Query Editor. If you want to use JupyterLab Notebook, you should create a new JupyterLab space in Amazon EMR Serverless.

3. Join data from other sources with S3 Tables data
With S3 Tables data now available in SageMaker Lakehouse, you can join it with data from data warehouses, online transaction processing (OLTP) sources like relational or non-relational database, Iceberg tables, and other third party sources to gain more comprehensive and deeper insights.

For example, you can add connections to data sources such as Amazon DocumentDB, Amazon DynamoDB, Amazon Redshift, PostgreSQL, MySQL, Google BigQuery, or Snowflake and combine data using SQL without extract, transform, and load (ETL) scripts.

Now you can run the SQL query in the Query editor to join the data in the S3 Tables with the data in the DynamoDB.

Here is a sample query to join between Athena and DynamoDB:

select * from "s3tablescatalog/s3tables-integblog-bucket"."blogdb"."customer", 
              "dynamodb1"."default"."customer_ddb" where cust_id=pid limit 10;

To learn more about this integration, visit Amazon S3 Tables integration with Amazon SageMaker Lakehouse in the AWS documentation.

Now available
S3 Tables integration with SageMaker Lakehouse is now generally available in all AWS Regions where S3 Tables are available. To learn more, visit the S3 Tables product page and the SageMaker Lakehouse page.

Give S3 Tables a try in the SageMaker Unified Studio today and send feedback to AWS re:Post for Amazon S3 and AWS re:Post for Amazon SageMaker or through your usual AWS Support contacts.

In the annual celebration of the launch of Amazon S3, we will introduce more awesome launches for Amazon S3 and Amazon SageMaker. To learn more, join the AWS Pi Day event on March 14.

Channy

How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Best practices for rapidly deploying Landing Zone Accelerator on AWS

Post Syndicated from Lei Shi original https://aws.amazon.com/blogs/devops/best-practices-for-rapidly-deploying-landing-zone-accelerator-on-aws/

Landing Zone Accelerator on AWS (LZA) enables customers to deploy a flexible, configuration-driven solution to establish a landing zone while also leveraging AWS Control Tower. At AWS Professional Services, we’ve helped customers deploy and configure LZA hundreds of times. A common request we encounter is integrating LZA configuration into customers’ existing GitOps workflows. GitOps has emerged as a leading model for Infrastructure as Code (IaC), helping organizations automate and manage their cloud infrastructure. The model uses Git repositories as the single source of truth for infrastructure configuration, enabling teams to maintain consistent, version-controlled environments.

In this blog, we will focus on common LZA implementation steps based on our experience, helping customers jump-start their LZA environment and implement GitOps for their AWS infrastructure management. First, we will demonstrate how to leverage LZA while complying with your organization’s policies such as private package repositories. Next, we will guide you through a new installation of LZA that takes advantage of an auto-generated starter set of configuration files. Finally, we will direct you to another blog post that will enable you to leverage GitOps for ongoing management of your LZA configuration.

Architecture overview

The LZA solution leverages two distinct repositories; one for the LZA source code, and another for your organization’s specific configuration files. LZA creates two separate AWS CodePipelines , which are used to install the LZA solution and apply your organization’s specific configuration. Figure 1 illustrates the association between repositories and pipelines. By default, when installing LZA, the solution uses GitHub as the source and pulls the installation files published by AWS from the official LZA GitHub repository.

a diagram with icons illustrating LZA solution components

Figure 1. Landing Zone Accelerator solution components

Deploy LZA as a new install

Step 1: Preparing your enterprise private GitHub to host LZA source code.
Customers may choose to deploy LZA from the official AWS GitHub repository for LZA, but we often we find customers have policies in place that require these types of packages to be deployed from a private repository managed by the organization. For customers using GitHub privately in their enterprise, this can be as easy as cloning the LZA source code repository into your own private GitHub repository, enabling you to take advantage of policies and controls within your organization. Before moving to the next step, take a moment and clone the repository into your own private repository. A GitHub personal access token stored in AWS Secrets Manager is required to enable the stack to access your private repository. Before deploying LZA, follow these instructions to enable stack access to your repository.

Step 2: In the organization management account, install LZA as a CloudFormation Stack.

To get started, we will be going through a new installation of the LZA solution. The following steps provide specific parameter options to the CloudFormation template to support a new installation of LZA.

Specify the following parameters for Source Code Repository Configuration, see Figure 2.

  1. For Stack name, specify a name you like.
  2. For Source Location, choose github.
  3. For Repository Owner, specify your GitHub account owner ID.
  4. For Repository Name, specify your cloned LZA source code repository
  5. For Branch Name, specify the branch name of your LZA source code repository.

a screenshot of a set of parameters setting of LZA source code repo in a cloudformation stack

Figure 2. LZA installer stack parameters – source code repository

Specify the following parameters for Configuration Repository Configuration, see Figure 3.

  1. For Configuration Repository Location, choose s3.
  2. For Use Existing Config Repository, choose No.
  3. For Existing Config Repository Name, Existing Config Repository Branch Name, Existing Config Repository Owner, and Existing Config Repository CodeConnection ARN, leave blank.

We intentionally want to use S3 for the configuration repository because as the LZA solution is installed, it will auto-generate a set of starter configuration YAML files and deploy them for us in S3. This makes it very easy to get started with an initial set of customized YAML files for your environment. We choose “No” in the Use Existing Config Repository field, to have LZA to perform a new LZA installation.

a screenshot of a set of parameters setting of LZA configuration repo in a cloudformation stack

Figure 3. LZA installer stack parameters – configuration repository

  1. Choose Next, and complete the remainder of the stack settings.
  2. Finally, choose Create stack to launch the CloudFormation stack.

The installer stack typically takes minutes to complete (See Figure 4).

a screenshot showing LZA installation cloudformation stack completed successfully

Figure 4. LZA installer stack completion

Step 3: Validate two LZA pipelines are created and successfully completed in AWS CodePipeline console.

After the CloudFormation stack completes, open the AWS CodePipeline console. You’ll see a new pipeline named “AWSAccelerator-Installer” running (See Figure 5). This is the LZA Installer pipeline, and it’s connected to the GitHub source repository you specified in Step 2 above with parameters from 2 to 5. This Installer pipeline automatically generates a set of LZA configuration files stored as a compressed ZIP archive in Amazon S3. It will be designated as configuration repository of the LZA solution.

a screenshot showing LZA installer pipeline created in AWS CodePipeline successfully

Figure 5. AWSAccelerator-Installer pipeline creation

When the AWSAccelerator-Installer pipeline completes, the solution automatically creates and runs a second pipeline named “AWSAccelerator-Pipeline” as shown in Figure 6. This pipeline connects to both the GitHub source repository, and the newly created configuration repository in Amazon S3. The AWSAccelerator-Pipeline is the pipeline that manages your landing zone deployment and customization.

a screenshot showing LZA core pipeline created in AWS CodePipeline successfully

Figure 6. AWSAccelerator-Pipeline created from the AWSAccelerator-Installer pipeline

After the AWSAccelerator-Pipeline completes, your LZA solution is ready for customization.

Step 4: Migrate the LZA configuration repository from S3 to GitHub

With the AWSAccelerator-Pipeline completed, your initial landing zone is now deployed, leveraging the configuration stored in your S3 bucket. For some customers, they may need to ensure that changes to the landing zone configuration are controlled through their existing GitOps processes and tooling. See Figure 7 as an example where the S3 configuration files have been copied to a customer owned GitHub repository. This transition step can be performed in future LZA upgrade window when there is a new release of LZA source code, or right after the initial LZA installation completes in Step 3. For more information on migrating from S3 to GitHub, follow this guide to configure your AWSAccelerator-Pipelines with AWS CodeConnection.

a diagram with icons illustrating LZA solution new components with LZA configuration repo migrated to GitHub

Figure 7. CodeConnection based LZA Configuration Repository

Conclusion

In this post, we explored key steps to streamline your LZA implementation journey. By demonstrating how to work with your private package repositories, providing guidance on leveraging auto-generated configuration files, and introducing GitOps-based management, we’ve outlined a practical path to establish and maintain a robust AWS infrastructure foundation. These approaches can significantly reduce the time and complexity typically associated with LZA deployments while ensuring compliance with organizational policies. We encourage you to try these implementation steps and explore the referenced resources to enhance your AWS cloud operations. For more information about Landing Zone Accelerator, visit the AWS Landing Zone Accelerator on GitHub.

About the authors

author photo of Lei Shi

Lei Shi

Lei Shi is a Delivery Consultant at AWS Professional Services, where he transforms complex cloud challenges into elegant solutions. He focuses on enabling organizations to build secure and scalable cloud environments throughout their AWS journey.

author photo of Adam Spicer

Adam Spicer

Adam Spicer is a Principal Cloud Architect for AWS Professional Services. He works with enterprise customers to design and build their cloud infrastructure and automation to accelerate their migration to AWS. He is an avid FSU Seminole fan who loves to be on outdoor adventures with his family.

How to restrict Amazon S3 bucket access to a specific IAM role

Post Syndicated from Chris Craig original https://aws.amazon.com/blogs/security/how-to-restrict-amazon-s3-bucket-access-to-a-specific-iam-role/

February 14, 2025: This post was updated with the recommendation to restrict S3 bucket access to an IAM role by using the aws:PrincipalArn condition key instead of the aws:userid condition key.

April 2, 2021: In the section “Granting cross-account bucket access to a specific IAM role,” we updated the second policy to fix an error.

July 11, 2016: This post was first published.


Customers often ask how to limit access to an Amazon Simple Storage Service (Amazon S3) bucket to only a specific AWS Identity and Access Management (IAM) user or role. A popular approach has been to use the Principal element to list the users or roles who need access to the bucket. However, the Principal element needs the exact values of the user ARN, role ARN, or assumed-role ARN. It does not support using a wildcard (*) to include all role sessions, nor does it allow you to use policy variables.

In this blog post, we show how to restrict S3 bucket access to a specific IAM role or user within an account by using the Conditions element. Even if another user in the same account has an Admin policy or a policy with s3:*, they will be denied access if they are not explicitly listed in the Conditions element. You can use this approach, for example, to limit access to a bucket with sensitive content or additional security requirements.

Solution overview

The solution in this post uses a bucket policy to restrict access to an S3 bucket, even if an entity has access to the full API of S3 through an attached identity-based policy. The following diagram illustrates how this works for accessing an S3 bucket within the same account as your IAM user or IAM role. We recommend that you use IAM roles, and only use IAM users for use cases that aren’t supported by federated users.

Figure 1: Diagram illustrating how to access an S3 bucket within the same account as your IAM user or IAM role

Figure 1: Diagram illustrating how to access an S3 bucket within the same account as your IAM user or IAM role

The workflow in Figure 1 is as follows:

  1. The IAM user’s policy and the IAM role’s identity-based policy grant access to “s3:*”.
  2. The S3 bucket policy associated with Bucket B restricts access to only the IAM role. This means that only the IAM role is able to access its content.
  3. Both the IAM user and the IAM role can access other S3 buckets (for example, Bucket A) in the account. The IAM role is able to access both buckets, but the user can access only the S3 buckets without the bucket policy attached to them. Even though both the role and the user have full “s3:*” permissions, the bucket policy negates access to the bucket for anyone that has not assumed the role.

The main difference in the cross-account approach is that every bucket must have a bucket policy attached to allow access to the IAM role from the other account. The following diagram illustrates how this works in a cross-account deployment scenario.

Figure 2: Diagram illustrating how to access an S3 bucket in a different account than your IAM role

Figure 2: Diagram illustrating how to access an S3 bucket in a different account than your IAM role

The workflow in Figure 2 is as follows:

  1. The IAM role’s identity-based policy and the IAM users’ policy in the bucket account both grant access to “s3:*”
  2. Bucket policy B denies access to all IAM users and roles except the role specified, and the policy defines what the role is allowed to do with the bucket.
  3. Bucket policy A allows access to the IAM role from the other account.
  4. The IAM user and IAM role can both access Bucket A because the IAM user is in the same account and there is an explicit Allow in bucket policy A for the role. The role can access both buckets because the Deny in bucket policy B is only for principals other than the IAM role.

Using the aws:PrincipalArn condition

You can use different types of condition keys to compare details about the principal making the request with the principal properties that you specify in the policy. We recommend that you use the aws:PrincipalArn key. The aws:PrincipalArn key compares the Amazon Resource Name (ARN) of the principal that made the request with the ARN that you specify in the policy.

You could also use the aws:userid policy variable to uniquely identify a user or role in their explicit Deny statements. There is added complexity with using aws:userid to find the value because you have to perform an API call using valid credentials. When working with IAM roles this activity has additional complexity because you are required to get the AssumedRoleUser information, which will not only include the unique role ID, but also the role-session-name that was provided while assuming the role. For example, the aws:userid for an AssumedRoleUser will be as follows:

aws:userid – AROADBQP57FF2AEXAMPLE:role-session-name

It becomes inconvenient to manage and track these IDs when you have a large list of users and roles to be included in the policy.

To mitigate these challenges, we recommend that you use the aws:PrincipalArn condition key. For IAM roles, the request context returns the ARN of the role, not the ARN of the user that assumed the role. AWS recommends that you specify the ARN for resources in policies instead of unique IDs and that you perform IAM policy audits on a periodic basis. Let’s look at how to use the condition key in an IAM policy.

Granting same-account bucket access to a specific role

When accessing a bucket from within the same account, in most cases it is not necessary to use a bucket policy because the policy defines access that is already granted by the user’s direct IAM policy. S3 bucket policies are usually used for cross-account access, but you can also use them to restrict access through an explicit Deny. The Deny would be applied to all principals whether they were in the same account as the bucket or within a different account.

In this case, you use the IAM user or role ARN with the aws:PrincipalArn condition key in a StringNotEquals or StringNotLike condition with a wildcard string. In addition, you use the aws:PrincipalARN key to compare the ARN of the principal that made the request with the ARN that you specify in the policy. Using a conditional logic element allows for the use of a wildcard string to allow for any role session name to be accepted.

Once you have the ARN of the role to which you want to allow access, you need to block the access of other users from within the same account as the bucket. An example policy to block access to the bucket and its objects for users that are not using the IAM role credentials would look like the following.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket",
        "arn:aws:s3:::amzn-s3-demo-bucket/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": [
            "arn:aws:iam::111122223333:role/<ROLE-NAME>"
          ]
        }
      }
    }
  ]
}

Use this same policy for IAM users as shown below.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket",
        "arn:aws:s3:::amzn-s3-demo-bucket/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalARN": [
            "arn:aws:iam::111122223333:role/<ROLE-NAME>”,
            “arn:aws:iam::111122223333:user/<USER-NAME>"
          ]
        }
      }
    }
  ]
}

Granting cross-account bucket access to a specific IAM role

When granting cross-account bucket access to an IAM user or role, you must define what the IAM user or role is allowed to do with the granted access. Learn more about the permissions needed to allow an IAM entity to access a bucket via the CLI/API and the console in Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket. Using the information found in this blog post, an example bucket policy would look like the following.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/<ROLE-NAME>"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/<ROLE-NAME>"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        },
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket",
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "aws:PrincipalARN": [
                        "arn:aws:iam::111122223333:role/<ROLE-NAME>"
                    ]
                }
            }
        }
    ]
}

To grant access to an IAM user in another account, you need to add the ARN for the IAM user to the aws:PrincipalArn condition as outlined in the previous section of this blog post. In addition to the aws:PrincipalArn condition, you would also need to add the IAM user’s full ARN to the Principal element of these policies. An example policy is shown below.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": [
                {
                    "AWS": [
                        "arn:aws:iam::444455556666:role/<ROLE-NAME>”,
                        “arn:aws:iam::444455556666:user/<USER-NAME>"
                    ]
                }
            ],
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket"
        },
        {
            "Effect": "Allow",
            "Principal": [
                {
                    "AWS": [
                        "arn:aws:iam::444455556666:role/<ROLE-NAME>”,
                        “arn:aws:iam::444455556666:user/<USER-NAME>"
                    ]
                }
            ],
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        },
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket",
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "aws:PrincipalARN": [
                        "arn:aws:iam::444455556666:role/<ROLE-NAME>”,
                        “arn:aws:iam::444455556666:user/<USER-NAME>"
                    ]
                }
            }
        }
    ]
}

In addition to including role permissions in the bucket policy, you need to define these permissions in the IAM user’s or role’s user policy. The permissions are added to a customer managed policy and attached to the role or user in the IAM console, with the following example policy document.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:ListAllMyBuckets",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
    }
  ]
}

By following the guidance in this post, you restrict S3 bucket access to a specific IAM role or user in same-account and cross-account scenarios, even if the user has an Admin policy or a policy with “s3:*”. There are many applications of this logic in which requirements will vary across use cases. We recommend to employ the principle of least privilege wherever possible, and to grant only the minimum permissions that are required to perform necessary tasks.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Identity and Access Management re:Post or contact AWS Support.
 

Chris Craig

The original author of this blog post is no longer at AWS. In 2016, when this post was first published, we did not include author bios.
Laura Verghote
Laura Verghote

Laura is a Senior Solutions Architect for public sector customers in the Europe, Middle East, and Africa (EMEA) region. She works with customers to design and build solutions in the AWS Cloud, bridging the gap between complex business requirements and technical solutions. She joined AWS as a technical trainer and has wide experience delivering training content to developers, administrators, architects, and partners across EMEA.
Ashwin Phadke
Ashwin Phadke

Ashwin is a Senior Solutions Architect working with large enterprises and independent software vendor (ISV) customers to build highly available, scalable, and secure applications, and to help them successfully navigate their cloud journeys. He is passionate about information security and enjoys working on creative solutions for customers’ security challenges.