Tag Archives: Technical How-to

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

2025-07-07 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/how-stifel-built-a-modern-data-platform-using-aws-glue-and-an-event-driven-domain-architecture/

Stifel Financial Corp. is an American multinational independent investment bank and financial services company, founded in 1890 and headquartered in downtown St. Louis, Missouri. Stifel offers securities-related financial services in the United States and Europe through several wholly owned subsidiaries. Stifel provides both equity and fixed income research and is the largest provider of US equity research.

In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.

Stifel’s modern data platform use case

Stifel envisioned a data platform that delivers accurate, timely, and properly governed data, providing consistency throughout the organization whenever users access the information. This approach showed limitations as the data complexity increased, data volumes grew, and demand for quick, business-driven insights rose. These challenges are encountered by financial institutions worldwide, leading to a reassessment of traditional data management practices. Under the federated governance model, Stifel developed a modern data strategy based on the following objectives:

Managing ingestion and metadata
Creating source-aligned data products complying with Stifel business streams
Integrating source-aligned data products from other domains (Stifel business units)
Producing consumer-aligned data products for specific business purposes
Publishing data products to a centralized data catalog

Some of the Stifel challenges highlighted in the preceding list required building a data platform that can:

Boost agility by democratizing data, thus reducing time to market and enhancing the customer experience
Improve data quality and trust in the data
Standardize tools and eliminate the shadow information technology (IT) culture to increase scalability, reduce risk, and minimize operational inefficiencies

Following the federated governance model, Stifel has organized its domain structure to provide autonomy to various functional teams while preserving the core values of data mesh. The following diagram depicts a high-level architecture of the data mesh implementation at Stifel.

Each data domain has the flexibility to create data products that can be published to the centralized catalog, while maintaining the autonomy for teams to develop data products that are exclusively accessible to teams within the domain. These products aren’t available to others until they are deemed ready for broader enterprise use. Domains have the freedom to decide which data they want to share. They can either:

Make their data products visible to everyone through the central catalog
Keep their data products visible only within their own domain

By implementing an event-driven domain architecture, organizations can achieve significant business advantages while positioning themselves for future growth and innovation. Stifel data products refreshes were dependent on data assets with variable cadence. Event-driven architecture enables real-time or near real-time updates by allowing data products to automatically respond to changes in underlying data assets as they occur, rather than relying on fixed batch schedules that might miss critical updates or waste resources on unnecessary refreshes. The key is to carefully plan the implementation and make sure of alignment with business objectives while considering both technical and organizational factors. This architecture style particularly suits organizations that:

Need real-time processing capabilities
Have complex domain interactions
Require high scalability
Want to improve business agility
Need better system integration
Are pursuing digital transformation

The following are some of the key AWS Services that helped Stifel to build their modern data platform.

AWS Glue is a serverless data integration service that’s used for data processing to build data assets and data products in the domains. Data is also cataloged in AWS Glue Catalog, making it straightforward to discover and query with supported engines.
Amazon EventBridge provides a scalable and flexible serverless event bus that facilitates seamless communication between different domains and services. By using EventBridge, Stifel was able to implement a publish-subscribe model where domain events can be emitted, filtered, and routed to appropriate consumers based on configurable rules. EventBridge supports custom event buses for domain-specific events, enabling clear separation of concerns and improved manageability.
AWS Lake Formation helped in providing centralized security, governance, and catalog capabilities while preserving domain autonomy in data product creation and management. With Lake Formation, data domains were able to maintain their independent data products within a federated structure while enforcing consistent access controls, data quality standards, and metadata management across the organization.
Apache Hudi on Amazon Simple Storage Service (Amazon S3) offers an optimized way to store data assets and products and promotes interoperability across other services.

Stifel’s solution architecture

The following diagram illustrates the data mesh architecture that Stifel uses to build a domain-driven architecture. In this system, various domains create data products and share them with other domains through a central governance account that uses Lake Formation.

Let’s look at some of the key design components that are being used to enable and implement data mesh and event driven design

Data ingestion framework

The data ingestion framework consists of several processor modules that are built using several AWS services and metadata driven architecture. The following diagram shows the architecture of the raw data ingestion framework.

The framework gets raw data files from both internal Stifel systems and third-party data sources. These files are processed and stored in a raw data ingestion account on Amazon S3 in open table format Apache Hudi. This stored data is then shared with different parts of the organization, called data domains. Each domain can use this shared data to create their own data products.

As a file (in CSV, XML, JSON and custom formats) lands into the landing bucket, an Amazon S3 event notification is created and placed in an Amazon Simple Queue Service (Amazon SQS)queue. The Amazon SQS queue triggers an AWS Lambda function and saves the metadata (such as the name of the file, date and time the file was received, and the file size) to a file audit data store (Amazon Aurora PostgreSQL-Compatible Edition).

An EventBridge time scheduler invokes an AWS Step Functions workflow at pre-determined intervals. The Step Functions workflow orchestrates the batch ingestion from raw to staging layer.

The Step Functions workflow orchestrates a set of Lambda functions to get the list of unprocessed raw files from the audit data store and create batches of raw files to process them in parallel. The Step Functions workflow then triggers parallel AWS Glue jobs that process each batch of raw files.
Each raw file is validated for any data quality checks and the data is saved to staging tables in Hudi format. Any errors encountered are logged into an audit table and a notification is generated for support team. For all successfully processed raw files, the file status is updated to PROCESSED and logged into an audit table.
After the Hudi table is updated, a data refresh event is sent to EventBridge and then passed to the Central Mesh Account. The Central Mesh Account forwards these events to the data domains to notify them that the raw tables are refreshed, allowing the data domains to use this data for creating their own data products.

Event driven data product refresh

The Stifel data lake is based on a data mesh architecture where several data producers share data across data domains. A mechanism is needed to alert consumers who depend on other data producers’ data products when those source data products are refreshed, so that the consumers can update their own data products accordingly. The following diagram describes the technical architecture of event-based data processing. The central governance account acts as the central event bus, which receives all data refresh events from all data producers. The central event bus forwards the events to consumer accounts. The consumer accounts filter the events consumers are interested in from data producers for their data processing needs.

Orchestration design

Stifel designed and implemented an event-based data pipeline orchestration system that triggers data pipelines when specific events occur. This system processes data immediately after receiving all required dependency events, enabling efficient workflow management.

The following diagram describes the logical architecture of the domain data pipeline orchestration framework.

The orchestration framework includes the components described in the following list. The data dependencies and data pipeline state management metadata are hosted in an Aurora PostgreSQL database.

Data refresh processor: Receives data refresh events from central mesh and local data domain and evaluates if the domain data products data dependencies are met
Data product dependency processor: Retrieves metadata for the product, kicks off a corresponding data domain AWS Glue job, and updates metadata with the job information
Data pipeline state change processor: Monitors the domain data jobs and takes actions based on the job’s final status (SUCCEED or FAILED) and then creates incident tickets for failed jobs

Conclusion

Stifel has improved its data management and reduced data silos by adopting a data product approach. This strategy has positioned Stifel to become a data-driven, customer-centric organization. The company combines federated platform practices with AWS and open standards. As a result, Stifel is achieving its decentralization objectives through a scalable data platform. This platform empowers domain teams to make informed decisions, drive innovation, and maintain a competitive edge. Here are the some of the advantages Stifel got from an event-driven domain architecture (EDDA):

Business agility: Rapid market response, new business capability integration, scalable domains, quicker feature deployment, and flexible process modification
Customer experience: Real-time processing, responsive interactions, personalized services, consistent omnichannel presence, and enhanced service availability
Operational efficiency: Reduced system coupling, optimal resource use, scalable systems, lower maintenance overhead, and efficient data processing
Cost benefits: Lower development costs, reduced infrastructure expenses, decreased maintenance costs, efficient resource usage, and a better ROI on technology investments

In this post, we demonstrated how Stifel is building a modern data platform by recognizing the critical importance of data in today’s financial landscape. This strategic approach not only enhances operational efficiency but also positions Stifel at the forefront of technological innovation in the financial services industry. To learn more and get started, see the following resources:

About the authors

Amit Maindola is a Senior Data Architect focused on data engineering, analytics, and AI/ML at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Srinivas Kandi is a Senior Architect at Stifel focusing on delivering the next generation of cloud data platform on AWS. Prior to joining Stifel, Srini was a delivery specialist in cloud data analytics at AWS helping several customers in their transformational journey into AWS cloud. In his free time, Srini likes to explore cooking, travel and learn new trends and innovations in AI and cloud computing.

Hossein Johari is a seasoned data and analytics leader with over 25 years of experience architecting enterprise-scale platforms. As Lead and Senior Architect at Stifel Financial Corp. in St. Louis, Missouri, he spearheads initiatives in Data Platforms and Strategic Solutions, driving the design and implementation of innovative frameworks that support enterprise-wide analytics, strategic decision-making, and digital transformation. Known for aligning technical vision with business objectives, he works closely with cross-functional teams to deliver scalable, forward-looking solutions that advance organizational agility and performance.

Ahmad Rawashdeh is a Senior Architect at Stifel Financial. He supports Stifel and its clients in designing, implementing, and building scalable and reliable data architectures on Amazon Web Services (AWS), with a strong focus on data lake strategies, database services, and efficient data ingestion and transformation pipelines.

Lei Meng is a data architect at Stifel. His focus is working in designing and implementing scalable and secure data solutions on the AWS and helping Stifel’s cloud migration from on-premises systems.

Kaltura reduces observability operational costs by 60% with Amazon OpenSearch Service

2025-07-03 Ido Ziv

Post Syndicated from Ido Ziv original https://aws.amazon.com/blogs/big-data/kaltura-reduces-observability-operational-costs-by-60-with-amazon-opensearch-service/

This post is co-written with Ido Ziv from Kaltura.

As organizations grow, managing observability across multiple teams and applications becomes increasingly complex. Logs, metrics, and traces generate vast amounts of data, making it challenging to maintain performance, reliability, and cost-efficiency.

At Kaltura, an AI-infused video-first company serving millions of users across hundreds of applications, observability is mission-critical. Understanding system behavior at scale isn’t just about troubleshooting—it’s about providing seamless experiences for customers and employees alike. But achieving effective observability at this scale comes with challenges: managing spans; correlating logs, traces, and events across distributed systems; and maintaining visibility without overwhelming teams with noise. Balancing granularity, cost, and actionable insights requires constant tuning and thoughtful architecture.

In this post, we share how Kaltura transformed its observability strategy and technological stack by migrating from a software as a service (SaaS) logging solution to Amazon OpenSearch Service—achieving higher log retention, a 60% reduction in cost, and a centralized platform that empowers multiple teams with real-time insights.

Observability challenges at scale

Kaltura ingests over 8TB of logs and traces daily, processing more than 20 billion events across 6 production AWS Regions and over 200 applications—with log spikes reaching up to 6 GB per second. This immense data volume, combined with a highly distributed architecture, created significant challenges in observability. Historically, Kaltura relied on a SaaS-based observability solution that met initial requirements but became increasingly difficult to scale. As the platform evolved, teams generated disparate log formats, applied retention policies that no longer reflected data value, and operated more than 10 organically grown observability sources. The lack of standardization and visibility required extensive manual effort to correlate data, maintain pipelines, and troubleshoot issues – leading to rising operational complexity and fixed costs that didn’t scale efficiently with usage.

Kaltura’s DevOps team recognized the need to reassess their observability solution and began exploring a variety of options, from self-managed platforms to fully managed SaaS offerings. After a comprehensive evaluation, they made the strategic decision to migrate to OpenSearch Service, using its advanced features such as Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Management.

Solution overview

Kaltura created a new AWS account that would be a dedicated observability account, where OpenSearch Service was deployed. Logs and traces were collected from different accounts and producers such as microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and services running on Amazon Elastic Compute Cloud (Amazon EC2).

By using AWS services such as AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), and Amazon CloudWatch, Kaltura was able to meet the standards to create a production-grade system while keeping security and reliability in mind. The following figure shows a high-level design of the environment setup.

Ingestion

As seen in the following diagram, logs are shipped using log shippers, also known as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a tool designed to collect, process, and transport log data from various sources to a centralized location, such as log analytics platforms, management systems, or an aggregator system. Fluent Bit was used in all sources and also provided light processing abilities. Fluent Bit was deployed as a daemonset in Kubernetes. The application development teams didn’t change their code, because the Fluent Bit pods were reading the stdout of the application pods.

The following code is an example of FluentBit configurations for Amazon EKS:

[INPUT]
   Name                tail
   Path                /var/log/containers/*.log
   Tag                 kube.*
   Skip_Long_Lines     On
   multiline.parser    docker, cri
[FILTER]
   alias               k8s
   # kubernetes filter to parse all logs
   Name                kubernetes
   Match               kube.*
   Kube_Tag_Prefix     kube.var.log.containers.
   Annotations         On
   Labels              Off
   Merge_Log           On
   Keep_Log            Off
   Kube_URL            https://kubernetes.default.svc.cluster.local:443 
[FILTER]
   alias               apps
   Name                rewrite_tag
   Match               kube.*
   Rule                $kubernetes['annotations']['kaltura.com/observability'] ^apps$ 
[OUTPUT]
   Name                http
   Match               apps.*
   Alias               apps
   Host                xxxxx.us-east-1.osis.amazonaws.com
   Port                443
   URI                 /log/apps
   Format              json
   aws_auth            true
   aws_region          us-east-1
   aws_service         osis
   aws_role_arn        arn:aws:iam::xxxxx:role/osis-ingestion-role
   Log_Level           trace
   tls On

Spans and traces were collected directly from the application layer using a seamless integration approach. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) using the OpenTelemetry Operator for Kubernetes. Additionally, the team developed a custom OTEL code library, which was incorporated into the application code to efficiently capture and log traces and spans, providing comprehensive observability across their system.

Data from Fluent Bit and OpenTelemetry Collector was sent to OpenSearch Ingestion, a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Each producer sent data to a specific pipeline, one for logs and one for traces, where data was transformed, aggregated, enriched, and normalized before being sent to OpenSearch Service. The trace pipeline used the otel_trace and service_map processors, while using the OpenSearch Ingestion OpenTelemetry trace analytics blueprint.

The following code is an example of the OpenSearch Ingestion pipeline for logs:

version: "2"
entry-pipeline:
 source:
   http:
     path: "/log/apps"

 processor:
   - add_entries:
       entries:
       - key: "log_type"
         value: "default"
       - key: "log_type"
         value: "api"
         add_when: 'contains(/filename, "api.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         value: "stats"
         add_when: 'contains(/filename, "stats.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         value: "event"
         add_when: 'contains(/filename, "event.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         value: "login"
         add_when: 'contains(/filename, "login.log")'
         overwrite_if_key_exists: true

   - grok:
       grok_when: '/log_type == "api"'
       match:
         log: ['^\[%%{DATA:timestamp}] \[%%{DATA:logIp}\] \[%%{DATA:host}\] \[%%{WORD:id}\] %%{WORD:priorityName}\(%%{NUMBER:priority}\): \[memory: %%{DATA:memory} MB, real: %%{DATA:real}MB\] %%{GREEDYDATA:message}']

   - date:
       match:
         - key: timestamp
           patterns: ["dd-MMM-yyyy HH:mm:ss", "dd/MMM/yyyy:HH:mm:ss Z", "EEE MMM dd HH:mm:ss.SSSSSS yyyy"]

       destination: "@timestamp"
       output_format: "yyyy-MM-dd'T'HH:mm:ss"

   - rename_keys:
       entries:
       - from_key: "timestamp"
         to_key: "@timestamp"
         overwrite_if_to_key_exists: false
       - from_key: "date"
         to_key: "@timestamp"
         overwrite_if_to_key_exists: false

   - drop_events:
       drop_when: 'contains(/filename, "simplesamlphp.log")'


 sink:
   - opensearch:
       hosts: ["${opensearch_host}"]
       index: '$${/env}-api-$${/log_type}-app-logs'
       index_type: custom
       action: create
       bulk_size: 20
       aws:
         sts_role_arn: ${sts_role_arn}
         region:  ${region}
       dlq:
         s3:
           bucket: "${bucket}"
           key_path_prefix: 'my-app-dlq-files'
           region: "${region}"
           sts_role_arn: "${sts_role_arn}"

The preceding example shows the use of processors such as grok, date, add_entries, rename_keys, and drop_events:

add_entries:
- Adds a new field log_type based on filename
- Default: “default”
- If the filename contains specific substrings (such as api.log or stats.log), it assigns a more specific type
grok:
- Applies Grok parsing to logs of type “api”
- Extracts fields like timestamp, logIp, host, priorityName, priority, memory, real, and message using a custom pattern
date:
- Parses timestamp strings into a standard datetime format
- Stores it in a field called @timestamp based on ISO8601 format
- Handles multiple timestamp patterns
rename_keys:
- timestamp or date are renamed into @timestamp
- Does not overwrite if @timestamp already exists
drop_events:
- Drops logs where filename contains simplesamlphp.log
- This is a filtering rule to ignore noisy or irrelevant logs

The following is an example of the input of a log line:

   "log": "[25-Mar-2025 18:23:18] [127.0.0.1] [the-most-awesome-server-in-kaltura] [67e2f496cc321] INFO(6): [memory: 4.51 MB, real: 6MB] [request: 1] [time: 0.0263s / total: 0.0263s]",

After processing, we get the following code:

    "log_type": "api",
    "priorityName": "INFO",
    "memory": "4.51",
    "host": "the-most-awesome-server-in-kaltura",
    "real": "6",
    "priority": "6",
    "message": "[request: 1] [time: 0.0263s / total: 0.0263s]",
    "logIp": "127.0.0.1",
    "id": "67e2f496cc321",
    "@timestamp": "2025-03-25T18:23:18"

Kaltura followed some OpenSearch Ingestion best practices, such as:

Including a dead-letter queue (DLQ) in pipeline configuration. This can significantly help troubleshoot pipeline issues.
Starting and stopping pipelines to optimize cost-efficiency, when possible.
During the proof of concept stage:
- Installing Data Prepper locally for faster development iterations.
- Disabling persistent buffering to expedite blue-green deployments.

Achieving operational excellence with efficient log and trace management

Logs and traces play a vital role in identifying operational issues, but they come with unique challenges. First, they represent time series data, which inherently evolves over time. Second, their value typically diminishes as time passes, making efficient management crucial. Third, they are append-only in nature. With OpenSearch, Kaltura faced distinct trade-offs between cost, data retention, and latency. The goal was to make sure valuable data remained accessible to engineering teams with minimal latency, but the solution also needed to be cost-effective. Balancing these factors required thoughtful planning and optimization.

Data was ingested to OpenSearch data streams, which simplifies the process of ingesting append-only time series data. Several Index State Management (ISM) policies were applied to different data streams, which were dependent on log retention requirements. ISM policies handled moving indexes from hot storage to UltraWarm, and eventually deleting the indexes. This allowed a customizable and cost-effective solution, with low latency for querying new data and reasonable latency for querying historical data.

The following example ISM policy makes sure indexes are managed efficiently, rolled over, and moved to different storage tiers based on their age and size, and eventually deleted after 60 days. If an action fails, it is retried with an exponential backoff strategy. In case of failures, notifications are sent to relevant teams to keep them informed.

{
    "id": "retention",
    "policy": {
        "description": "production ISM",
        },
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "rollover": {
                            "min_primary_shard_size": "30gb",
                            "copy_alias": false
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "warm_migration": {}
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "14d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "60d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "*-logs"
                ],
                "priority": 50,
            }
        ]
    }
}

To create a data stream in OpenSearch, a definition of index template is required, which configures how the data stream and its backing indexes will behave. In the following example, the index template specifies key index settings such as the number of shards, replication, and refresh interval—controlling how data is distributed, replicated, and refreshed across the cluster. It also defines the mappings, which describe the structure of the data—what fields exist, their types, and how they should be indexed. These mappings make sure the data stream knows how to interpret and store incoming log data efficiently. Finally, the template enables the @timestamp field as the time-based field required for a data stream.

{
  "index_patterns": [
    "*my-app-logs"
  ],
  "template": {
    "settings": {
      "index.number_of_shards": "32",
      "index.number_of_replicas": "0",
      "index.refresh_interval": "60s"
    },
    "mappings": {
      "properties": {
        "priorityName": {
          "type": "keyword"
        },
        "log_type": {
          "type": "keyword"
        },
        "@timestamp": {
          "type": "date"
        },
        "memory": {
          "type": "float"
        },
        "host": {
          "type": "keyword"
        },
        "pid": {
          "type": "keyword"
        },
        "real": {
          "type": "float"
        },
        "env": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "priority": {
          "type": "integer"
        },
        "logIp": {
          "type": "ip"
        }
      }
    }
  },
  "composed_of": [],
  "priority": "100",
  "_meta": {
    "flow": "simple"
  },
  "data_stream": {
    "timestamp_field": {
      "name": "@timestamp"
    }
  },
  "name": "my-app-logs"
}

Implementing role-based access control and user access

The new observability platform is accessed by many types of users; internal users log in to OpenSearch Dashboards using SAML-based federation with Okta. The following diagram illustrates the user flow.

Each user accesses the dashboards to view observability items relevant to their role. Fine-grained access control (FGAC) is enforced in OpenSearch using built-in IAM role and SAML group mappings to implement role-based access control (RBAC).When users log in to the OpenSearch domain, they are automatically routed to the appropriate tenant based on their assigned role. This setup makes sure developers can create dashboards tailored to debugging within development environments, and support teams can build dashboards focused on identifying and troubleshooting production issues. The SAML integration alleviates the need to manage internal OpenSearch users entirely.

For each role in Kaltura, a corresponding OpenSearch role was created with only the necessary permissions. For instance, support engineers are granted access to the monitoring plugin to create alerts based on logs, whereas QA engineers, who don’t require this functionality, are not granted that access.

The following screenshot shows the role of the DevOps engineers defined with cluster permissions.

These users are routed to their own dedicated DevOps tenant, to which they only have write access. This makes it possible for different users from different roles in Kaltura to create the dashboard items that focus on their priorities and needs. OpenSearch supports backend role mapping; Kaltura mapped the Okta group to the role so when a user logs in from Okta, they automatically get assigned based on their role.

This also works with IAM roles to facilitate automations in the cluster using external services, such as OpenSearch Ingestion pipelines, as can be seen in the following screenshot.

Using observability features and service mapping for enhanced trace and log correlation

After a user is logged in, they can use the Observability plugins, view surrounding events in logs, correlate logs and traces, and use the Trace Analytics plugin. Users can inspect traces and spans, and group traces with latency information using built-in dashboards. Users can also drill down to a specific trace or span and correlate it back to log events. The service_map processor used in OpenSearch Ingestion sends OpenTelemetry data to create a distributed service map for visualization in OpenSearch Dashboards.

Using the combined signals of traces and spans, OpenSearch discovers the application connectivity and maps them to a service map.

After OpenSearch ingests the traces and spans from Otel, they are aggregated to groups according to paths and trends. Durations are also calculated and presented to the user over time.

With a trace ID, it’s possible to filter out all the relevant spans by the service and see how long each took, identifying issues with external services such as MongoDB and Redis.

From the spans, users can discover the relevant logs.

Post-migration enhancements

After the migration, a strong developer community emerged within Kaltura that embraced the new observability solution. As adoption grew, so did requests for new features and enhancements aimed at improving the overall developer experience.

One key improvement was extending log retention. Kaltura achieved this by re-ingesting historical logs from Amazon Simple Storage Service (Amazon S3) using a dedicated OpenSearch Ingestion pipeline with Amazon S3 read permissions. With this enhancement, teams can access and analyze logs from up to a year ago using the same familiar dashboards and filters.

In addition to monitoring EKS clusters and EC2 instances, Kaltura expanded its observability stack by integrating more AWS services. Amazon API Gateway and AWS Lambda were introduced to support log ingestion from external vendors, allowing for seamless correlation with existing data and broader visibility across systems.

Finally, to empower teams and promote autonomy, data stream templates and ISM policies are managed directly by developers within their own repositories. By using infrastructure as code tools like Terraform, developers can define index mappings, alerts, and dashboards as code—versioned in Git and deployed consistently across environments.

Conclusion

Kaltura successfully implemented a smart log retention strategy, extending real time retention from 5 days for all log types to 30 days for critical logs, while maintaining cost-efficiency through the use of UltraWarm nodes. This approach led to a 60% reduction in costs compared to their previous solution. Additionally, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate systems into a unified, all-in-one solution. This consolidation not only improved operational efficiency but also sparked increased engagement from developer teams, driving feature requests, fostering internal design collaborations, and attracting early adopters for new enhancements. If Kaltura’s journey has inspired you and you’re thinking about implementing a similar solution in your organization, consider these steps:

Start by understanding the requirements and setting expectations with the engineering teams in your organization
Start with a quick proof of concept to get hands-on experience
Refer to the following resources to help you get started:

About the authors

Ido Ziv is a DevOps team leader in Kaltura with over 6 years of experience. His hobbies include sailing and Kubernetes (but not at the same time).

Roi Gamliel is a Senior Solutions Architect helping startups build on AWS. He is passionate about the OpenSearch Project, helping customers fine-tune their workloads and maximize results.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to use data, gain insights, and derive value.

AWS Certificate Manager now supports exporting public certificates

2025-07-01 Pravin Nair

Post Syndicated from Pravin Nair original https://aws.amazon.com/blogs/security/aws-certificate-manager-now-supports-exporting-public-certificates/

AWS Certificate Manager (ACM) simplifies the provisioning, management, and deployment of public and private TLS certificates for AWS services and your on-premises and hybrid applications. To further enhance the flexibility of ACM for diverse workloads, we’re introducing a powerful new capability: ACM exportable public certificates. You can use this capability to export public TLS certificates and associated private keys from ACM, which can be used to secure workloads on Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Kubernetes Service (Amazon EKS) pods, on-premises servers, or servers hosted with other cloud providers. The capability supports public certificates that are newly created in your AWS account

In this post, we show you how to automate the export and distribution of public exportable certificates across a diverse infrastructure. We walk you through creating workflows that automatically deliver certificates to multiple destinations including EC2 instances and virtual machines in hybrid environments. We explore how this automation works, its benefits, and provide a step-by-step guide to get started. Additionally, we explore how you can use integration with Amazon EventBridge to trigger automatic certificate exports when certificates are issued or renewed, streamlining certificate deployment across heterogeneous environments and significantly reducing management overhead.

Background: ACM and certificate management

ACM is a managed service that removes the complexity of purchasing, uploading, and renewing TLS certificates. It provides public certificates at no additional cost for AWS services integrated with ACM such as Elastic Load Balancing (ELB), Amazon CloudFront, and Amazon API Gateway. ACM also supports importing third-party public certificates and issuing private certificates through AWS Private Certificate Authority. Prior to this release, ACM public certificates were designed for AWS services integrated with ACM such as CloudFront, providing seamless TLS encryption for those services. For use cases involving third-party content delivery networks (CDNs) or workloads terminating TLS on EC2 instances, customers typically sourced certificates from other providers or imported them into ACM for centralized management. Customers have told us that they would like to use ACM for these use cases, extending its simplicity and scalability to a wider range of environments. The new ACM exportable public certificates capability fulfills this need, enabling you to export ACM-managed public certificates for use with your custom workloads while maintaining centralized management and automated renewals.

With ACM you can now request a public certificate, validate domain ownership, and export the certificate for use with software that terminates TLS such as Apache, NGINX, or Microsoft IIS. ACM handles certificate renewals, reducing the risk of expirations that can disrupt your applications.

How it works: ACM public certificate issuance and renewal

To use ACM exportable public certificates, you need to understand how to automate certificate management using the issuance and renewal processes. In this section, we describe these processes and their automation capabilities, which are critical for deploying and maintaining certificates.

ACM public certificate issuance

Issuing an ACM public certificate involves the following steps:

Request a certificate: In the AWS Management Console for ACM, or the AWS Command Line Interface (AWS CLI) or API, initiate a certificate request by specifying the domain names you want to secure (for example, example.com or *.example.com).
Validate domain ownership: ACM requires that you prove control over the domain. If the domain is hosted on Amazon Route 53, you can request that ACM validate the domain ownership. For domains hosted outside AWS, you can use DNS validation (adding a CNAME record) or email validation (responding to emails sent to domain contacts).
Certificate issuance: After the domain ownership has been validated, ACM issues the certificate, which includes the public key, private key, and certificate chain.
Associate the certificate with an integrated AWS service: See Services integrated with ACM for information about associating the certificate with an integrated AWS service.
Export the certificate: With the new capability you can now export the public certificate, private key, and certificate chain using the ACM console, AWS CLI, or API for use on servers that aren’t integrated with ACM.
Bind to application: Install the exported certificate on your server (for example, Apache or NGINX) to enable TLS termination.

With the launch of this new capability, you can now control the future exportability of public certificates that you create in ACM.

To create an exportable public certificate, use the ACM console to create a new public certificate. To get started, choose Request certificate in the ACM console and on the Request public certificate page, under Allow export, select Enable export. If you select Disable export, the private key for this certificate will be disallowed for exporting from ACM, which cannot be changed after certificate issuance.

Figure 1: Request a public certificate and enable export

After creating your certificate with the
Enable export option selected and completing domain ownership validation, you can proceed with the export process, as shown in Figure 2. To export your certificate, select it from the list of certificates, choose
More actions, and select
Export.

Figure 2: Export a certificate

ACM public certificate renewal

ACM automates the process of certificate renewal, which includes:

Renewal initiation: ACM automatically initiates renewal 60 days before a certificate expires.
Domain revalidation: ACM revalidates domain ownership using the same method as the initial issuance (DNS or email).
Certificate update: Upon successful revalidation, ACM issues a new certificate with the same Amazon Resource Name (ARN) with updated validity dates.
When a certificate is renewed in ACM, the service automatically sends an EventBridge event to notify you that the new certificate is available. If the renewal fails, ACM sends notifications to both the AWS Health Dashboard and EventBridge. To stay informed about these certificate events, you can create EventBridge rules that monitor for specific certificate-related events. You can configure these rules to send notifications to an Amazon Simple Notification Service Amazon (SNS) topic so that interested parties receive timely updates about their certificate status.

New EventBridge schema fields: Following successful ACM certificate renewal, the ACM Certificate Available event now includes an exportable field that indicates with TRUE|FALSE whether the public certificate is ready to be exported.

{
    "version": "0",
    "id": "id", 
    "detail-type": "ACM Certificate Available",
    "source": "aws.acm",
    "account": "account",
    "time": "2019-12-22T18:43:48Z",
    "region": "region",
    "resources": [
        "arn:aws:acm:region:account:certificate/certificate_ID"
    ],
    "detail": {
       “Action” : "ISSUANCE" | "RENEWAL" | "IMPORT" | "REIMPORT",
       "CertificateType" : "AMAZON_ISSUED" | "PRIVATE" | "IMPORTED",    
       "CommonName": "example.com",     
       "DomainValidationMethod" : "EMAIL" | "DNS",    
       "CertificateCreatedDate" : "2019-12-22T18:43:48Z",
       "CertificateExpirationDate" : "2019-12-22T18:43:48Z",
       "DaysToExpiry" : 395,
       "InUse" : TRUE | FALSE,    
       "Exported" : TRUE | FALSE,
       "Exportable" : TRUE | FALSE   <== New     
     }
}

Export and update: You can export the renewed certificate and update it on your servers manually or using EventBridge targets such as AWS Systems Manager Automation documents triggered by EventBridge rules. For more information, see Event bus targets in Amazon EventBridge.

You can use EventBridge rules to monitor specific events and route them to one or more targets (such as Amazon SNS topics, AWS Lambda functions, or other AWS services) for processing. For example, when domain validation fails because of DNS configuration issues, ACM generates an ACM Certificate Renewal Action Required EventBridge event. By creating an EventBridge rule that targets an SNS topic, you can subscribe to receive email alerts and take necessary corrective actions.

Automating deployment of renewed certificates using EventBridge

The certificate renewal process helps make sure that your TLS certificates remain valid without manual intervention, but updating certificates across diverse environments can still require effort. When ACM renews a certificate, it generates an EventBridge event. You can configure EventBridge rules to trigger targets based on this event, such as:

Send notifications: Route the event to Amazon SNS to send email or SMS notifications to administrators.
Automate certificate deployment: Trigger Lambda functions or Systems Manager Automation documents to retrieve the renewed certificate using the ACM API and update it on your servers.
Monitor renewal failures: Configure alerts based on ACM certificate renewal failure events. These events can be directly routed to notification channels to inform you about issues such as domain validation errors.

To set this up, create an EventBridge rule to match the ACM renewal event, specify a target, (such as an SNS topic or Lambda function). This automation minimizes manual intervention, helping to facilitate seamless certificate updates across your infrastructure.

Solution overview

In the section, we describe two workflows. The first demonstrates an automated process for exporting existing ACM public certificates and installing them on target EC2 instances or virtual machines. The second workflow is triggered when public certificates are automatically renewed by ACM when they become available in ACM, followed by updating these certificates on downstream EC2 instances and virtual machines. While this solution uses EC2 instances and virtual machines as the target systems, the same methods can be applied to refresh public certificates at scale across various types of systems.

Prerequisites

To extend this automated public certificate export and update process to:
1. Register EC2 instances: Follow the instructions in Managing EC2 instances with Systems Manager.
2. Register on-premises and other cloud environments’ virtual machines: Follow the instructions in Managing nodes in hybrid and multicloud environments with Systems Manager.
Add TargetTagKey tags to EC2 instances and virtual machines where you want to deploy renewed certificates. The automation uses these tags to identify target instances.
The ExportCertificate API requires a certificate passphrase for operation. To maintain security best practices, we recommend storing passwords in encrypted form using password vaults instead of plain text storage. Our implementation uses AWS Secrets Manager to securely store these sensitive credentials. The solution also uses Amazon DynamoDB to maintain certificate metadata, which includes a reference to the corresponding secret name stored in Secrets Manager. For added security, the DynamoDB table’s data is automatically encrypted at rest using AWS Key Management Service (AWS KMS).

ACM certificate export

Figure 3: ACM certificate issuance and export workflow

The workflow shown in Figure 3 demonstrates an automated process for exporting existing public ACM certificates through an API-driven process and deploying them to downstream systems.

The process begins when a user makes a request to an API Gateway endpoint, providing essential parameters including the CertificateArn to identify the certificate you want to export, CertName for certificate identification, and TargetTagKey and TargetTagValue for identifying the target EC2 instances where you want this certificate to be installed. The following is an example of the payload sent to API Gateway:
```
{
  "CertificateArn": "arn:aws:acm:us-east-1:1234567890123:certificate/8106d6b2-f204-4354-8893-d49e311b3900",
  "CertName": "academe",
  "TargetTagKey": "env",
  "TargetTagValue": "dev"
}
```
Upon receiving the request, API Gateway triggers an AWS Step Functions workflow containing multiple orchestrated states.
The initial state executes a Lambda function named acm-Export, which generates a passphrase for the private key.
The acm-Export lambda function also securely stores the generated passphrase in Secrets Manager and uses the generated passphrase to export the ACM certificate.
After completing the acm-Export function, the Step Functions workflow invokes the Lambda ssm-run function.
This function performs two operations: it checks the certificate’s existence in DynamoDB (which serves as an inventory tracking system) and manages record-keeping. When the function encounters an existing certificateARN, it updates the record with the current CertExpiryDate and LastExportedDate timestamp values. For certificates being exported for the first time, the Lambda function creates a new record in DynamoDB if no matching entry exists. This new record captures the certificate’s metadata, including its details and tracking information. Figure 4 shows how this metadata is structured in a DynamoDB table entry in the console.

Figure 4: Certificate metadata in a DynamoDB table

Following the metadata verification step in DynamoDB, the Lambda function also initiates running a custom Systems Manager document called Install-ACMCertificate. This document handles the installation of newly exported public certificates onto specified EC2 instances. The same Systems Manager document can be used for certificate installation or updates onto on-premises servers, providing flexibility in certificate deployment.
When the Systems Manager document execution succeeds, it deploys the newly exported public certificates to EC2 instances matching the TargetTagKey. By default, on Linux servers, certificates are stored in /etc/ssl/certs and /etc/ssl/private, though these paths can be customized in the Systems Manager document.
After successfully running this Systems Manager document, the Step Functions workflow then advances to its next state, which triggers another Lambda function named Statuscheck. This function monitors the execution status of the previously initiated Systems Manager document. The Step Functions workflow concludes its execution after it confirms the successful installation of certificates on the targeted EC2 instances.

ACM certificate renewal and export

Figure 5: ACM certificate and renewal process

When a certificate is within 60 days of expiring, ACM automatically begins the renewal process. When ACM successfully completes a certificate renewal, it generates an
event in EventBridge as shown in the following example:


{
	"version": "0",
	"id": "id", 
	"detail-type": "ACM Certificate Available",
	"source": "aws.acm",
	"account": "account",
	"time": "2019-12-22T18:43:48Z",
	"region": "region",
	"resources": [
	"arn:aws:acm:region:account:certificate/certificate_ID"
	],
	"detail": 
	{
		"Action" : "RENEWAL",
		"CertificateType" : "AMAZON_ISSUED”, 
		"CommonName": "", 
		"DomainValidationMethod" : "DNS", 
		"CertificateCreatedDate" : "2025-05-22T18:43:48Z",
		"CertificateExpirationDate" : "2026-06-23T18:43:48Z",
		"DaysToExpiry" : 395,
		"InUse" : “TRUE”, 
		"Exported" : “TRUE”, 
		}
	}

The workflow illustrated in Figure 5 showcases an automated system for exporting existing public ACM certificates using an API-driven process and deploying them to downstream systems.

The solution uses an EventBridge rule that watches for certificate renewal notifications and triggers the acm-renew Lambda function in response. The function begins its execution by receiving the certificate ARN from the ACM event. Using this ARN as a lookup key, it queries a DynamoDB table to retrieve the associated certificate metadata. From this query, it extracts essential certificate details including the Certificate Name and the TargetTag Key-Value pairs that identify which resources need the updated certificate. These details are needed for the subsequent certificate deployment process and help make sure that the updates are applied to the correct systems.
This information is then formatted into a payload and used to trigger a Step Functions workflow. This Step Functions workflow follows the same process described in the ACM Certificate Export section.
Steps 3 through 9 follow the process described in the ACM Certificate Export section. Upon successful completion of step 9, the Step Functions workflow concludes its execution. At this point, the renewed public certificate has been successfully installed on the targeted EC2 instances, completing the automated certificate export and installation process.

Detailed instructions for downloading the solution, executing it, validating the certificate export, and deploying it to your AWS account are available on GitHub.

Pricing and availability

ACM exportable public certificates are available in AWS commercial Regions, AWS GovCloud (US) Regions, and China Regions and follow a pay-as-you go pricing model, with no upfront commitments. You pay only for the certificates you export. Public certificates for AWS Services integrated with ACM such as ELB, CloudFront, and API Gateway remain available at no additional cost. For detailed pricing, see AWS Certificate Manager pricing.

Conclusion

The ACM exportable public certificates capability empowers customers to secure diverse workloads with a unified, managed certificate solution. By enabling certificate exports for EC2, containers, on-premises servers and other cloud providers, ACM simplifies TLS management, while offering centralized control, automated renewals and cost-effective pricing. Get started today by exploring this feature in the ACM console and streamline your certificate management workflows.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Enforce table level access control on data lake tables using AWS Glue 5.0 with AWS Lake Formation

2025-06-30 Layth Yassin

Post Syndicated from Layth Yassin original https://aws.amazon.com/blogs/big-data/enforce-table-level-access-control-on-data-lake-tables-using-aws-glue-5-0-with-aws-lake-formation/

AWS Glue 5.0 now supports Full-Table Access (FTA) control in Apache Spark based on your policies defined in AWS Lake Formation. This new feature enables read and write operations from your AWS Glue 5.0 Spark jobs on Lake Formation registered tables when the job role has full table access. This level of control is ideal for use cases that need to comply with security regulations at the table level. In addition, you can now use Spark capabilities including Resilient Distributed Datasets (RDDs), custom libraries, and user-defined functions (UDFs) with Lake Formation tables. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application. Data teams can run complex, interactive Spark applications through Amazon SageMaker Unified Studio in compatibility mode while maintaining the table-level security boundaries provided by Lake Formation. This simplifies security and governance of your data lakes.

In this post, we show you how to enforce FTA control on AWS Glue 5.0 through Lake Formation permissions.

How access control works on AWS Glue

AWS Glue 5.0 supports two features that achieve access control through Lake Formation:

Full-Table Access (FTA) control
Fine-grained access control (FGAC)

At a high level, FTA supports access control at the table level whereas FGAC can support access control at the table, row, column, and cell levels. To support more granular access control, FGAC uses a tight security model based on user/system space isolation. By maintaining this extra level of security, only a subset of Spark core classes are allowlisted. Additionally, there is extra setup for enabling FGAC, such as passing the --enable-lakeformation-fine-grained-access parameter to the job. For more information about FGAC, see Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation.

While this level of granular control is essential for organizations that need to comply with data governance, security regulations, or deal with sensitive data, it’s excessive for organizations that only need table level access control. To provide customers with a way to enforce table level access without the performance, cost, and setup overhead introduced by the tighter security model in FGAC, AWS Glue introduced FTA. Let’s dive into FTA, the main topic of this post.

How Full-Table Access (FTA) works in AWS Glue

Until AWS Glue 4.0, Lake Formation-based data access worked through GlueContext class, the utility class provided by AWS Glue. With the launch of AWS Glue 5.0, Lake Formation-based data access is available through native Spark SQL and Spark DataFrames.

With this launch, when you have full table access to your tables through Lake Formation permissions, you don’t need to enable fine-grained access mode for your AWS Glue jobs or sessions. This eliminates the need to spin up a system driver and system executors, because they’re designed to allow fine-grained access, resulting in lower performance overhead and lower cost. In addition, although Lake Formation fine-grained access mode supports read operations, FTA supports not only read operations, but also write operations through CREATE, ALTER, DELETE, UPDATE, and MERGE INTO commands.

To use FTA mode, you must allow third-party query engines to access data without the AWS Identity and Access Management (IAM) session tag validation in Lake Formation. To do this, follow the steps in Application integration for full table access.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The high-level steps to enable the Spark native FTA feature are documented in Using AWS Glue with AWS Lake Formation for Full Table Access. However, in this section, we will go through an end-to-end example of how to migrate an AWS Glue 4.0 job that uses FTA through GlueContext to read an Iceberg table to an AWS Glue 5.0 job that uses Spark native FTA.

Prerequisites

Before you get started, make sure that you have the following prerequisites:

An AWS account with AWS Identity and Access Management (IAM) roles as needed:
- Lake Formation data access IAM role that isn’t a service-linked role.
- AWS Glue job execution role with AWS managed policy AWSGlueServiceRole attached and lakeformation:GetDataAccess permission. Be sure to include the AWS Glue service in the trust policy.
The required permissions to perform the following actions:
- Read or write to an Amazon Simple Storage Service (Amazon S3) bucket.
- Create and run AWS Glue jobs.
- Manage AWS Glue Data Catalog databases and tables.
- Manage Amazon Athena workgroups and run queries.
Lake Formation set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.

For this post, we use the us-east-1 AWS Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.

You will walk through setting up test data and an example AWS Glue 4.0 job using GlueContext, but if you already have these and are only interested in how to migrate, proceed to Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA. With the prerequisites in place, you’re ready start the implementation steps.

Create an S3 bucket and upload a sample data file

To create an S3 bucket for the raw input datasets and Iceberg table, complete the following steps:

On the AWS Management Console for Amazon S3, choose Buckets in the navigation pane.
Choose Create bucket.
Enter the bucket name (for example, glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and leave the remaining fields as default.
Choose Create bucket.
On the bucket details page, choose Create folder.
Create two subfolders: raw-csv-input and iceberg-datalake.

Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create an AWS Glue database and AWS Glue tables

To create input and output sample tables in the Data Catalog, complete the following steps:

On the Athena console, navigate to the query editor.
Run the following queries in sequence (provide your S3 bucket name):

-- Create database for the demo
CREATE DATABASE glue5_fta_demo;

-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE glue5_fta_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 category string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3:///raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE glue5_fta_demo.iceberg_datalake WITH (
  table_type='ICEBERG',
  format='parquet',
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3:///iceberg-datalake/'
) AS SELECT * FROM glue5_fta_demo.raw_csv_input;

Run the following query to validate the raw CSV input data:

SELECT * FROM glue5_fta_demo.raw_csv_input;The following screenshot shows the query result:

Run the following query to validate the Iceberg table data:

SELECT * FROM glue5_fta_demo.iceberg_datalake;The following screenshot shows the query result:

This step used DDL to create table definitions. Alternatively, you can use a Data Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

The next step is to configure Lake Formation permissions on the iceberg_datalake table.

Configure Lake Formation permissions

To validate the capability, you need to define FTA permissions for the iceberg_datalake Data Catalog table you created. To start, enable read access to iceberg_datalake.

To configure Lake Formation permissions for the iceberg_datalake table, complete the following steps:

On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
Choose Register location.
For Amazon S3 path, enter the path of your S3 bucket to register the location.
For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
For Permission mode, select Lake Formation.

Choose Register location.

Grant permissions on the Iceberg table

The next step is to grant table permissions on the iceberg_datalake table to the AWS Glue job role.

On the Lake Formation console, choose Data permissions under Permissions in the navigation pane.
Choose Grant.
For Principals, choose IAM users and roles.
For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
For LF-Tags or catalog resources, choose Named Data Catalog resources.
For Catalogs, choose your account ID (the default catalog).
For Databases, choose glue5_fta_demo.
For Tables, choose iceberg_datalake.
For Table permissions, choose Select and Describe.
For Data permissions, choose All data access.

Next, create the AWS Glue PySpark job to process the input data.

Query the Iceberg table through an AWS Glue 4.0 job using GlueContext and DataFrames

Next, create a sample AWS Glue 4.0 job to load data from the iceberg_datalake table. You will use this sample job as a source of migration. Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
For Create job, choose Script Editor.
For Engine, choose Spark.
For Options, choose Start fresh.
Choose Create script.
For Script, replace the following parameters:
1. Replace aws_region with your Region.
2. Replace aws_account_id with your AWS account ID.
3. Replace warehouse_path with your Amazon S3 warehouse path for the Iceberg table.

For more information about how to use Iceberg in AWS Glue 4.0 jobs, see Using the Iceberg framework in AWS Glue.

from awsglue.context import GlueContext
from pyspark.sql import SparkSession

catalog_name = "glue_catalog"
aws_region = "us-east-1"
aws_account_id = "123456789012"
warehouse_path = "s3:///warehouse/"

# Initialize Spark and Glue contexts
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled","true") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region",f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.id",f"{aws_account_id}") \
    .getOrCreate()
glueContext = GlueContext(spark.sparkContext)

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

# Read the Iceberg table
df = glueContext.create_data_frame.from_catalog(
    database=database_name,
    table_name=table_name,
)
df.show()

On the Job details tab, for Name, enter glue-fta-demo-iceberg.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 4.0 – Supports spark 3.3, Scala 2, Python 3.
For Job parameters, add the following parameters:
1. Key: --conf
2. Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
3. Key: --datalake-formats
4. Value: iceberg
Choose Save and then Run.
When the job is complete, on the Run details tab, choose Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The output table is shown in the following screenshot. You see the same output that you saw in Athena when you verified that the Iceberg table was populated. This is because the AWS Glue job execution role has full table access from the Lake Formation permissions that you granted:

If you were to run this same AWS Glue job with another IAM role that wasn’t granted access to the table in Lake Formation, you would see an error Insufficient Lake Formation permission(s) on iceberg_datalake. Use the following steps to replicate this behavior:

Create a new IAM role that’s identical to the AWS Glue job execution role you already used, but don’t grant permissions to this clone in Lake Formation.
Change the role in the AWS Glue console for glue-fta-demo-iceberg to the new cloned role.
Rerun the job. You should see the error.
For the purposes of this post, change the role back to the original job execution role that’s registered in Lake Formation so you can use it in the next steps.

You now have an FTA setup in AWS Glue 4.0 that uses GlueContext DataFrames for an Iceberg table. You saw how roles that are granted permission in Lake Formation can read, and how roles that aren’t granted permission in Lake Formation cannot read. In the next section, we show you how to migrate from AWS Glue 4.0 GlueContext FTA to AWS Glue 5.0 native Spark FTA.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The Lake Formation permission granting experience is identical regardless of the AWS Glue version and Spark data structures used. Therefore, assuming you have a working Lake Formation setup for your AWS Glue 4.0 job, you don’t need to modify those permissions during migration. Here are the migration steps using the AWS Glue 4.0 example from the previous sections:

Allow third-party query engines to access data without the IAM session tag validation in Lake Formation. Follow the step-by-step guide in Application integration for full table access.
You shouldn’t need to change the job runtime role if you have AWS Glue 4.0 FTA working (see the example permissions in the prerequisites). The main IAM permission to verify is that the AWS Glue job execution role has lakeformation:GetDataAccess.
Modify the Spark session configurations in the script. Verify that the following Spark configurations are present:

--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3:///warehouse/
--conf spark.sql.catalog.spark_catalog.client.region=REGION
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.lakeformation-enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true

For more info about the above three steps, see Using AWS Glue with AWS Lake Formation for Full Table Access.

Update the script so that GlueContext DataFrames are changed to native Spark DataFrames. For example, the updated script for the previous AWS Glue 4.0 job would now look like:

from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "us-east-1"
aws_account_id = ""
warehouse_path = "s3:///warehouse/"

spark = SparkSession.builder \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.defaultCatalog", f"{catalog_name}") \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region", f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled", "true") \
    .config(f"spark.sql.catalog.dropDirectoryBeforeTable.enabled", "true") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

df = spark.sql(f"select * from {database_name}.{table_name}")
df.show()

You can remove the --conf job argument that was added in the AWS Glue 4.0 job because it’s set in the script itself now.

For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.

To verify that roles that don’t have Lake Formation permissions granted for them aren’t able to access the Iceberg table, you can repeat the same exercise you did in AWS Glue 4.0 and reuse the clone job execution role to rerun the job. You should see the error message: AnalysisException: Insufficient Lake Formation permission(s) on glue5_fta_demo

You’ve completed the migration and now have an FTA setup in AWS Glue 5.0 that uses native Spark and reads from an Iceberg table. You saw that roles that are granted permission in Lake Formation can read and that roles that aren’t granted permission in Lake Formation cannot read.

Clean up

Complete the following steps to clean up your resources:

Delete the AWS Glue job glue-fta-demo-iceberg.
Delete the Lake Formation permissions.
Delete the bucket that you created for the input datasets, which might have a name similar to glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This post explained how you can enable Spark native FTA in AWS Glue 5.0 jobs that will enforce access control defined using Lake Formation grant commands. For previous AWS Glue versions, you needed to integrate AWS Glue DataFrames to enforce FTA in AWS Glue jobs or migrate to AWS Glue 5.0 FGAC, which has relatively limited functionality. With this release, if you don’t need fine-grained control, you can enforce FTA through Spark DataFrame or Spark SQL for more flexibility and performance. This capability is currently supported for Iceberg and Hive tables.

This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.

About the authors

Layth Yassin is a Software Development Engineer on the AWS Glue team. He’s passionate about tackling challenging problems at a large scale, and building products that push the limits of the field. Outside of work, he enjoys playing/watching basketball, and spending time with friends and family.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Implement secure hybrid and multicloud log ingestion with Amazon OpenSearch Ingestion

2025-06-25 Xiaoxue Xu

Post Syndicated from Xiaoxue Xu original https://aws.amazon.com/blogs/big-data/implement-secure-hybrid-and-multicloud-log-ingestion-with-amazon-opensearch-ingestion/

Running applications across hybrid or multicloud environments creates a common challenge: fragmented logs scattered across different platforms. This fragmentation complicates monitoring, slows troubleshooting, and reduces operational visibility. To address this, many organizations seek to implement secure log ingestion from all environments into a centralized platform.

Amazon OpenSearch Service provides a unified solution for real-time search, analytics, and log management across your entire infrastructure. Amazon OpenSearch Ingestion, a fully managed data collector, simplifies data processing with built-in capabilities to filter, transform, and enrich your logs before analysis.

However, securely sending logs from non-AWS environments presents a challenge. Every request to OpenSearch Ingestion requires AWS Signature Version 4 (AWS SigV4) authentication, traditionally requiring long-term credentials that introduce security risks. AWS Identity and Access Management Roles Anywhere solves this problem by providing temporary credentials for workloads running outside AWS.

In this post, we demonstrate how to configure Fluent Bit, a fast and flexible log processor and router supported by various operating systems, to securely send logs from any environment to OpenSearch Ingestion using IAM Roles Anywhere. This approach alleviates the need for long-term credentials while providing a comprehensive view of your application logs across all environments—improving security, simplifying operations, and enhancing your ability to quickly resolve issues.

Solutions overview

The solution in this post uses Fluent Bit to collect logs, retrieve temporary credentials from IAM Roles Anywhere, and sign HTTP log ingestion requests with AWS SigV4 before sending them to the OpenSearch Ingestion pipeline. The following diagram shows the architecture.

Architecture for log ingestion with AWS IAM Roles Anywhere

This solution provisions the following key components:

Certificate authority – For this post, we use AWS Private Certificate Authority (AWS Private CA) as the certificate authority (CA) source. Alternatively, you can integrate with an external CA; for more details, see IAM Roles Anywhere with an external certificate authority. Certificates issued from public CAs can’t be used as trust anchors for IAM Roles Anywhere.
X.509 Certificate – We use a sample private certificate stored in AWS Certificate Manager (ACM) and issued by AWS Private CA.
IAM Roles Anywhere configuration – This includes the following:
- Trust anchor – Establishes trust between IAM Roles Anywhere and the specified CA.
- IAM role – Grants permissions for log ingestion and trusts the IAM Roles Anywhere service principal. At minimum, this role must be granted permission for the osis:Ingest action.
- Profile – Defines which roles IAM Roles Anywhere can assume and the maximum permissions granted with the temporary credentials.
OpenSearch Service domain – For this post, we use an OpenSearch Service domain, which is an AWS provisioned equivalent of an open source OpenSearch cluster. We create the domain within a virtual private cloud (VPC); see VPC versus public domains for more information. Alternatively, you can use an Amazon OpenSearch Serverless collection, which is an OpenSearch cluster that scales compute capacity based on your application’s needs.
OpenSearch Ingestion – This is configured to receive logs over HTTP as the pipeline source and forward them to the OpenSearch Service domain as the pipeline sink.

Connectivity between AWS and your hybrid or multicloud environments

You can access your OpenSearch Ingestion pipelines using an interface VPC endpoint with push-based HTTP source, which provides private IP address connectivity. For production environments, we recommend using these private connections through interface endpoints for enhanced security.

Setting up this connectivity requires additional configuration, such as creating an AWS Site-to-Site VPN connection with your hybrid and multicloud network. Although this post focuses on the log ingestion solution, you can find detailed guidance on network connectivity in the following resources:

Hybrid connectivity – Learn about different methods to connect your on-premises networks to AWS
Configuring VPC access for Amazon OpenSearch Ingestion pipelines – Set up secure private access to your ingestion pipelines
Access Amazon OpenSearch Service using an OpenSearch Service-managed VPC endpoint (AWS PrivateLink) – Configure private endpoints for your OpenSearch Service domain

How Fluent Bit retrieves temporary credentials using IAM Roles Anywhere

Using the HTTP output plugin, Fluent Bit can send logs to the OpenSearch Ingestion pipeline. The following diagram is a simplified view of how Fluent Bit retrieves AWS credentials.

How Fluent Bit retrieve AWS credentials

On Linux systems, Fluent Bit can use an AWS Command Line Interface (AWS CLI) profile that uses the credential_process parameter to trigger an external process. This external process is invoked to generate or retrieve credentials not directly supported by the AWS CLI.

The following are two common mechanisms for the external process:

IAM Roles Anywhere – Uses X.509 certificates to authenticate and returns temporary IAM credentials through IAM Roles Anywhere
OpenID Connect (OIDC) federation – Exchanges an OIDC authentication token for temporary AWS credentials

Although both options are viable, this post focuses on IAM Roles Anywhere. In this setup, the AWS IAM Roles Anywhere Credential Helper is executed to handle the signing process for the CreateSession API. This returns credentials in a JSON format that Fluent Bit can consume.

As of this writing, the Fluent Bit aws_profile configuration is supported only on Linux. It is untested on other Unix-based systems (such as macOS) and is not implemented for Windows.

Prerequisites

Before you begin this walkthrough, make sure you have the following:

AWS account requirements – This includes:
- An AWS account with permissions to deploy AWS CloudFormation templates.
- Access to AWS CloudShell for exporting a sample private certificate we will create using AWS CloudFormation in a later step.
Remote (hybrid or multicloud) environment – You must have a remote machine with Linux-based operating system. This solution was tested on Ubuntu 24.04 with the following additional tooling installed:

Deploy AWS resources with AWS CloudFormation

Follow these steps to deploy AWS resources required for this solution:

Choose Launch Stack:
Enter a unique name for Stack name. The default value is osis-with-iamra.
Configure the stack parameters. Default values are provided in the following table.

Parameter	Default value	Description
`CACommonName`	`example.com`	Common Name for the CA
`CACountry`	`US`	Organization for the CA
`CAOrganization`	`Example Org`	Country for the CA
`CAValidityInDays`	`1826`	Validity period in days for the CA certificate
`VPCCIDR`	`10.0.0.0/16`	IPv4 CIDR range for the VPC used for OpenSearch Service domain
`PublicSubnetCIDR`	`10.0.0.0/24`	IPv4 CIDR range for public subnet
`PrivateSubnet1CIDR`	`10.0.1.0/24`	IPv4 CIDR range for private subnet
`PrivateSubnet2CIDR`	`10.0.2.0/24`	IPv4 CIDR range for private subnet
`DomainName`	`test-domain`	Name of the OpenSearch Service domain
`PipelineName`	`test-pipeline`	Name of the OpenSearch Ingestion pipeline
`PipelineIngestionPath`	`/test-ingestion-path`	Ingestion path for the OpenSearch Ingestion pipeline

Select the acknowledgement check box and choose Create Stack.
Stack deployment takes about 30 minutes to complete.
When stack creation is complete, navigate to the Outputs tab on the AWS CloudFormation console and note down the values for the resources created.
The following table summarizes the output values.

Output	Description	Example value
`ACMCertificateArn`	Amazon Resource Name (ARN) of the ACM certificate. You will use this for exporting certificate and private key files using the AWS CLI in a later step.	`arn:aws:acm:aa-example-1:111122223333:certificate/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111`
`CertificateAuthorityArn`	ARN of the Private CA.	`arn:aws:acm-pca:aa-example-1:111122223333:certificate-authority/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222`
`TrustAnchorArn`	ARN of the IAM Roles Anywhere profile. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:rolesanywhere:aa-example-1:111122223333:trust-anchor/a1b2c3d4-5678-90ab-cdef-EXAMPLE33333`
`IngestionRoleArn`	ARN of the OpenSearch Ingestion role. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:iam::111122223333:role/role-name-with-path`
`ProfileArn`	ARN of the IAM Roles Anywhere profile. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:rolesanywhere:aa-example-1:111122223333:profile/a1b2c3d4-5678-90ab-cdef-EXAMPLE44444`
`OpenSearchDomainEndpoint`	Endpoint of the VPC OpenSearch domain. You will use this public endpoint for querying your index after ingestion.	`vpc-my-domain-123456789012.aa-example-1.es.amazonaws.com`
`PipelineEndpoint`	Endpoint of the OpenSearch Ingestion pipeline. You will use this public endpoint in the Fluent Bit configuration.	`my-pipeline-123456789012.aa-example-1.osis.amazonaws.com`
`PipelineIngestionPath`	Ingestion path for the OpenSearch Ingestion pipeline.	`/test-ingestion-path`

Export a sample private certificate using CloudShell

Follow these steps to export the sample private certificate created by the CloudFormation stack:

Open CloudShell. For more details, see Navigating the AWS CloudShell interface.
Export the certificate ARN from the CloudFormation outputs. If you changed the stack name in the previous step, use that value for <stack-name>, otherwise use the default value osis-with-iamra.

export CERT_ARN=$(aws cloudformation describe-stacks \
    --stack-name <stack-name> \
    --query 'Stacks[0].Outputs[?OutputKey==`ACMCertificateArn`].OutputValue' \
    --output text)

Extract the certificate and private key files:

# Generate and save the passphrase
export PASSPHRASE=$(openssl rand -base64 32)

# Export certificate using environment variables
aws acm export-certificate \
    --certificate-arn $CERT_ARN \
    --passphrase $(echo -n "$PASSPHRASE" | base64) \
    > cert_export.json

# Extract components to separate files
jq -r '.Certificate' cert_export.json > certificate.pem
jq -r '.PrivateKey' cert_export.json > encrypted_private_key.pem

# Decrypt the private key
openssl rsa -in encrypted_private_key.pem -out private_key.pem -passin pass:"$PASSPHRASE"

# Clear environment variables
unset PASSPHRASE CERT_ARN

Download the extracted certificate and private key files from CloudShell:
1. /home/cloudshell-user/certificate.pem
2. /home/cloudshell-user/private_key.pem

Configure an AWS CLI profile

Follow these steps to configure an AWS CLI profile for your log ingestion environment:

Store the downloaded certificate and private key to your environment. For an automated approach to generate and rotate certificates, see Set up AWS Private Certificate Authority to issue certificates for use with IAM Roles Anywhere.
Create a new profile named osis-pipeline-credentials that invokes the credential process. Replace the placeholders with your specific values. Find the values for trusted-anchor-arn, profile-arn, and ingestion-role-arn in your CloudFormation stack outputs.

aws configure set profile.osis-pipeline-credentials.credential_process "</path/to/aws_signing_helper> credential-process \
      --certificate </path/to/certificate.pem> \
      --private-key </path/to/private_key.pem> \
      --trust-anchor-arn <trusted-anchor-arn> \
      --profile-arn <profile-arn> \
      --role-arn <ingestion-role-arn>"

Verify your configuration. Open the ~/.aws/config file and confirm it contains a profile named osis-pipeline-credentials similar to the following:

[profile osis-pipeline-credentials]
credential_process = </path/to/aws_signing_helper> credential-process       --certificate </path/to/certificate.pem>       --private-key </path/to/private_key.pem>       --trust-anchor-arn <trusted-anchor-arn>       --profile-arn <profile-arn>       --role-arn <ingestion-role-arn>

Configure Fluent Bit

Run the following command to create a Fluent Bit configuration. Replace the placeholders with your specific values. Find the osis-pipeline-endpoint and pipeline-ingestion-path values in your CloudFormation stack outputs.

cat << 'EOF' > ~/fluent-bit.conf
[INPUT]
  name                  tail
  path                  /var/log/syslog
  read_from_head        true
  refresh_interval      5
[OUTPUT]
  name 			http
  match	   		*
  aws_service 		osis
  host 			<osis-pipeline-endpoint>
  port 			443
  uri 			<pipeline-ingestion-path> 
  format 		json
  aws_auth	 	true
  aws_region 		<aa-example-1>   
  aws_profile 		osis-pipeline-credentials
  tls 			On
EOF

This example configuration includes the following:

Uses the tail input plugin to monitor the /var/log/syslog file
Uses the http output plugin to flush log records to the OpenSearch Ingestion pipeline endpoint
Uses the osis-pipeline-credentials profile to obtain temporary AWS credentials for SigV4 authentication (aws_auth set to true)

Test the solution

Follow these steps to test the setup:

Start the Fluent Bit client with the configuration file fluent-bit.conf that you created in the previous step. Replace the placeholder with the value applicable to your environment. For Ubuntu 24.04, the default path of the Fluent Bit client is /opt/fluent-bit/bin/fluent-bit. Adjust the path if using other distributions.

sudo AWS_CONFIG_FILE=~/.aws/config <path-to-fluent-bit> -c ~/fluent-bit.conf

Because the solution in this post launched the OpenSearch Service domain within a VPC, you will need an environment that has connectivity to the VPC. For this post, we create a CloudShell VPC environment to run the commands in the next step. Find the VPC, subnet, and security group to use from your CloudFormation stack outputs.
The solution that you deployed through AWS CloudFormation dynamically creates indexes based on ingestion timestamps, format logs-%{yyyy.MM.dd}. You can specify your preferred naming using OpenSearch Ingestion index management. You can query your OpenSearch index using your preferred tool to see the ingested logs from Fluent Bit. We use awscurl in a CloudShell environment as shown in the following example. Replace the placeholders with your specific values. Find the opensearch-domain-endpoint value in your CloudFormation stack outputs.

pip install awscurl

export OPENSEARCH_DOMAIN_ENDPOINT=https://<opensearch-domain-endpoint>

# List indices matching logs-%{yyyy.MM.dd} format and get most recent one to query
export INDEX=$(awscurl --service es "$OPENSEARCH_DOMAIN_ENDPOINT/_cat/indices?v" | grep -E "logs-[0-9]{4}\.[0-9]{2}\.[0-9]{2}" | sort -r | head -1 | awk '{print $3}')

awscurl --service es $OPENSEARCH_DOMAIN_ENDPOINT/$INDEX/_search \
        -X GET -H "Content-Type: application/json" \
        -d '{
            "size": 10,
            "sort": [
              {"@timestamp": {"order": "desc"}}
            ],
            "query": { "match_all": {} }
          }' | jq '.hits.hits[]._source'

The following is an example of the expected output:

{
  "date": 1732039662.399506,
  "log": "2024-11-19T18:07:42.399375+00:00 test-server fluent-bit[9986]: 200 OK",
  "@timestamp": "2024-11-19T18:07:42.812Z"
}
{
  "date": 1732039662.399501,
  "log": "2024-11-19T18:07:42.399224+00:00 test-server fluent-bit[9986]: [2024/11/19 18:07:42] [ info] [output:http:http.0] test-pipeline-123456789012.us-east-2.osis.amazonaws.com:443, HTTP status=200",
  "@timestamp": "2024-11-19T18:07:42.812Z"
}
...

Clean up

To avoid future charges, remove the deployed resources:

Delete the CloudFormation stack.
Remove generated files from CloudShell:

rm cert_export.json encrypted_private_key.pem certificate.pem private_key.pem

Conclusion

In this post, we demonstrated how to obtain temporary credentials from IAM Roles Anywhere and securely ingest logs from hybrid or multicloud environments into OpenSearch Service using OpenSearch Ingestion. This approach minimizes the risk of credential exposure while enabling centralized log collection from distributed workloads. This solution is particularly valuable for organizations managing complex infrastructures across multiple environments and looking to consolidate observability data in OpenSearch Service. For additional details, refer to the following resources:

If you have questions or feedback about this post, please leave them in the comments section.

About the Authors

Xiaoxue Xu is a Solutions Architect for AWS based in Toronto. She primarily works with financial services customers to help secure their workload and design scalable solutions on the AWS Cloud.

Simran Singh is a Senior Solutions Architect at AWS. In this role, he assists our large enterprise customers in meeting their key business objectives using AWS. His areas of expertise include artificial intelligence and machine learning, security, and improving the experience of developers building on AWS.

Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

2025-06-24 Jose Romero

Post Syndicated from Jose Romero original https://aws.amazon.com/blogs/big-data/capture-data-lineage-from-dbt-apache-airflow-and-apache-spark-with-amazon-sagemaker/

The next generation of Amazon SageMaker is the center for your data, analytics, and AI. SageMaker brings together AWS artificial intelligence and machine learning (AI/ML) and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data. From Amazon SageMaker Unified Studio, a single interface, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, as well as generative AI development. This unified experience is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance experience at every step.

With data lineage, now part of SageMaker Catalog, domain administrators and data producers can centralize lineage metadata of their data assets in a single place. You can track the flow of data over time, giving you a clear understanding of where it originated, how it has changed, and its ultimate use across the business. By providing this level of transparency around the origin of data, data lineage helps data consumers gain trust that the data is correct for their use case. Because data lineage is captured at the table, column, and job level, data producers can also conduct impact analysis and respond to data issues when needed.

Capture of data lineage in SageMaker starts after connections and data sources are configured and lineage events are generated when data is transformed in AWS Glue or Amazon Redshift. This capability is also fully compatible with OpenLineage, so you can further augment data lineage capture to other data processing tools. This post walks you through how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push data lineage events programmatically from tools supporting the OpenLineage standard like dbt, Apache Airflow, and Apache Spark.

Solution overview

Many third-party and open source tools that are used today to orchestrate and run data pipelines, like dbt, Airflow, and Spark, have active support of the OpenLineage standard to provide interoperability across environments. With this capability, you only need to include and configure the right library to your environment, to be able to stream lineage events from jobs running on this tool directly to their corresponding output logs or to a target HTTP endpoint that you specify.

With the target HTTP endpoint option, you can introduce a pattern to post lineage events from these tools into SageMaker or Amazon DataZone to further help you centralize governance of your data assets and processes in a single place. This pattern takes the form of a proxy, and its simplified architecture is illustrated in the following figure.

The way that the proxy for OpenLineage works is simple:

Amazon API Gateway exposes an HTTP endpoint and path. Jobs running with the OpenLineage package on top of the supported data processing tools can be set up with the HTTP transport option pointing to this endpoint and path. If connectivity allows, lineage events will be streamed into this endpoint as the job runs.
An Amazon Simple Queue Service (Amazon SQS) queue buffers the events as they arrive. By storing them in a queue, you have the option to implement strategies for retries and errors when needed. For cases where event order is required, we recommend the use of first-in-first-out (FIFO) queues; however, SageMaker and Amazon DataZone are able to map incoming OpenLineage events, even if they are out of order.
An AWS Lambda function retrieves events from the queue in batches. For every event in a batch, the function can perform transformations when needed and post the resulting event to the target SageMaker or Amazon DataZone domain.
Even though it’s not shown in the architecture, AWS Identity and Access Management (IAM) and Amazon CloudWatch are key capabilities that allow secure interaction between resources with minimum permissions and logging for troubleshooting and observability.

The AWS sample OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone provides a working implementation of this simplified architecture that you can test and customize as needed. To deploy in a test environment, follow the steps as described in the repository. We use an AWS CloudFormation template to deploy solution resources.

After you have deployed the OpenLineage HTTP Proxy solution, you can use it to post lineage events from data processing tools like dbt, Airflow, and Spark into a SageMaker or Amazon DataZone domain, as shown in the following examples.

Set up the OpenLineage package for Spark in AWS Glue 4.0

AWS Glue added built-in support for OpenLineage with AWS Glue 5.0 (to learn more, see Introducing AWS Glue 5.0 for Apache Spark). For jobs that are still running on AWS Glue 4.0, you still can stream OpenLineage events into SageMaker or Amazon DataZone by using the OpenLineage HTTP Proxy solution. This serves as an example that can be applied to other platforms running Spark like Amazon EMR, third-party solutions, or self-managed clusters.

To properly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage events into the OpenLineage HTTP Proxy solution, complete the following steps:

Download the official OpenLineage package for Spark. For our example, we used the JAR package version 2.12 release 1.9.1.
Store the JAR file in an Amazon Simple Storage Service (Amazon S3) bucket that can be accessed by your AWS Glue job.
On the AWS Glue console, open your job.
Under Libraries, for Dependent JARs path, enter the path of the JAR package stored in your S3 bucket.

In the Job parameters section, add the following parameters:
1. Enable the OpenLineage package:
  1. Key: --user-jars-first
  2. Value: true
2. Configure how the OpenLineage package will be used to stream lineage events. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the corresponding values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack. Replace <ACCOUNT_ID> with your AWS account ID.
  1. Key: --conf
  2. Value:
```
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
--conf spark.openlineage.transport.type=http 
--conf spark.openlineage.transport.url=<OPENLIMEAGE_PROXY_ENDPOINT_URL>
--conf spark.openlineage.transport.endpoint=/<OPENLIMEAGE_PROXY_ENDPOINT_PATH>
--conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
--conf spark.glue.accountId=<ACCOUNT_ID>
```

With this setup, the AWS Glue 4.0 job will use the HTTP transport option of the OpenLineage package to stream lineage events into the OpenLineage proxy, which will post events to the SageMaker or Amazon DataZone domain.

Run the AWS Glue 4.0 job.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

The origin path in this example is extensive and maps the resulting dataset down to its origin, in this case, a couple of tables hosted in a relational database and transformed through a data pipeline with two AWS Glue 4.0 (Spark) jobs.

Set up the OpenLineage package for dbt

dbt has rapidly become a popular framework to build data pipelines on top of data processing and data warehouse tools like Amazon Redshift, Amazon EMR, and AWS Glue, as well as other traditional and third-party solutions. This framework supports OpenLineage as a way to standardize generation of lineage events and integrate with the growing data governance ecosystem.dbt deployments might vary per environment, which is why we don’t dive into the specifics in this post. However, to simply configure your dbt project to leverage the OpenLineage HTTP Proxy solution, complete the following steps:

Install the OpenLineage package for dbt. You can learn more in the OpenLineage documentation.
In the root folder of your dbt project, create an openlineage.yml file where you can specify the transport configuration. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack.

transport:
  type: http
  url: <OPENLIMEAGE_PROXY_ENDPOINT_URL>
  endpoint: <OPENLIMEAGE_PROXY_ENDPOINT_PATH>
  timeout: 5

Run your dbt pipeline. As explained in the OpenLineage documentation, instead of running the standard dbt run command, you run the dbt-ol run command. The latter command is just a wrapper on top of the standard dbt run command so that lineage events are captured and streamed as configured.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its lineage path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the dbt project is running on top of Amazon Redshift, which is a common use case among customers. Amazon Redshift is integrated for automatic lineage capture with SageMaker and Amazon DataZone, but such capabilities weren’t used as part of this example to illustrate how you can still integrate OpenLineage events from dbt using the pattern implemented in the OpenLineage HTTP Proxy solution.The dbt pipeline is made by two stages running sequentially, which are illustrated in the origin path as the nodes with the dbt type.

Set up the OpenLineage package for Airflow

Airflow is a well-positioned tool to orchestrate data pipelines at any scale. AWS provides Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed alternative for customers that want to reduce management and accelerate the development of their data strategy with Airflow in a cost-effective way. Airflow also supports OpenLineage, so you can centralize lineage with tools like SageMaker and Amazon DataZone.

The following steps are specific for Amazon MWAA, but they can be extrapolated to other forms of deployment of Airflow:

Install the OpenLineage package for Airflow. You can learn more in the OpenLineage documentation. For versions 2.7 and later, it’s recommended to use the native Airflow OpenLineage package (apache-airflow-providers-openlineage), which is the case for this example.
To install the package, add it to the requirements.txt file that you are storing in Amazon S3 and that you are pointing to when provisioning your Amazon MWAA environment. To learn more, refer to Managing Python dependencies in requirements.txt.
As you install the OpenLineage package or afterwards, you can configure it to send lineage events to the OpenLineage proxy:
1. When filling the form to create a new Amazon MWAA environment or edit an existing one, in the Airflow configuration options section, add the following. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack:
  1. Configuration option: openlineage.transport
  2. Custom value: {"type": "http", "url": "<OPENLIMEAGE_PROXY_ENDPOINT_URL>", "endpoint": "<OPENLIMEAGE_PROXY_ENDPOINT_PATH>"}

Run your pipeline.

The Airflow tasks will automatically use the transport configuration to stream lineage events into the OpenLineage proxy as they run. The task’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them.As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the Amazon MWAA Directed Acyclic Graph (DAG) is operating on top of Amazon Redshift, similar to the dbt example before. However, it’s still not using the native integration for automatic data capture between Amazon Redshift and SageMaker or Amazon DataZone. This way, we can illustrate how you can still integrate OpenLineage events from Airflow using the pattern implemented in the OpenLineage HTTP Proxy solution.The Airflow DAG is made by a single task that outputs the resulting dataset by using datasets that were created as part of the dbt pipeline in the previous example. This is illustrated in the origin path, where it includes nodes with the dbt type and a node with AIRFLOW type. With this final example, note how SageMaker and Amazon DataZone map all datasets and jobs to reflect the reality of your data pipelines.

Additional considerations when implementing the OpenLineage proxy pattern

The OpenLineage proxy pattern implemented in the sample OpenLineage HTTP Proxy solution and presented in this post has shown to be a practical approach to integrate a growing set of data processing tools into a centralized data governance strategy on top of SageMaker. We encourage you to dive into it and use it in your test environments to learn how it can be best used for your specific setup.If interested in taking this pattern to production, we suggest you first review it thoroughly and customize it to your particular needs. The following are some items worth reviewing as you evaluate this pattern implementation:

The solution used in the examples of this post uses a public API endpoint with no authentication or authorization mechanism. For a production workload, we recommend limiting access to the endpoint to a minimum so only authorized resources are able to stream messages into it. To learn more, refer to Control and manage access to HTTP APIs in API Gateway.
The logic implemented in the Lambda function is intended to be customized depending on your use case. You might need to implement transformation logic, depending on how OpenLineage events are created by the tool you are using. As a reference, for the case of the Amazon MWAA example of this post, some minor transformations were required on the name and namespace fields of the inputs and outputs elements of the event for full compatibility with the format expected for Amazon Redshift datasets as described in the dataset naming conventions of OpenLineage. You might also need to change how the function logs execution details or include retry/error logic and more.
The SQS queue used in the OpenLineage HTTP Proxy solution is standard, which implies that events aren’t delivered in order. If this is a requirement, you could use FIFO queues instead.

For cases where you want to post OpenLineage events directly into SageMaker or Amazon DataZone, without using the proxy pattern explained in this post, a custom transport is now available as an extension of the OpenLineage project version 1.33.0. Leverage this feature in cases where you don’t need additional controls on your OpenLineage event stream, for example, if you don’t need any custom transformation logic.

Summary

In this post, we showed how to use the OpenLineage-compatible APIs of SageMaker to capture data lineage from any tool supporting this standard, by following an architectural pattern introduced as the OpenLineage proxy. We presented some examples of how you can set up tools like dbt, Airflow, and Spark to stream lineage events to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone domain. Finally, we introduced a working implementation of this pattern that you can test and discussed some considerations when implementing this same pattern to production.

The SageMaker compatibility with OpenLineage can help simplify governance of your data assets and increase trust in your data. This capability is one of the many features that are now available to build a comprehensive governance strategy powered by data lineage, data quality, business metadata, data discovery, access automation, and more. By bundling data governance capabilities with the growing set of tools available for data and AI development, you can derive value from your data faster and get closer to consolidating a data-driven culture. Try out this solution and get started with SageMaker to join the growing set of customers that are modernizing their data platform.

About the authors

Jose Romero is a Senior Solutions Architect for Startups at AWS, based in Austin, Texas. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Manager with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on building products and their capabilities in data analytics and governance. She is passionate about building innovative products to address and simplify customers’ challenges in their end-to-end data journey. Outside of work, she enjoys being outdoors to hike and capture nature’s beauty. Connect with her on LinkedIn.

Amazon Bedrock baseline architecture in an AWS landing zone

2025-06-23 Abdel-Rahman Awad

Post Syndicated from Abdel-Rahman Awad original https://aws.amazon.com/blogs/architecture/amazon-bedrock-baseline-architecture-in-an-aws-landing-zone/

As organizations increasingly adopt Amazon Bedrock to build and deploy large-scale AI applications, it’s important that they understand and adopt critical network access controls to protect their data and workloads. These generative AI-enabled applications might have access to sensitive or confidential information within their knowledge bases, Retrieval Augmented Generation (RAG) data sources, or models themselves, which could pose a risk if exposed to unauthorized parties. Additionally, organizations might want to limit access to certain AI models to specific teams or services, making sure only authorized users can use the most powerful capabilities. Another important consideration is cost optimization, because organizations need to be able to monitor and control access to manage various aspects of their cloud spending.

In this post, we explore the Amazon Bedrock baseline architecture and how you can secure and control network access to your various Amazon Bedrock capabilities within AWS network services and tools. We discuss key design considerations, such as using Amazon VPC Lattice auth policies, Amazon Virtual Private Cloud (Amazon VPC) endpoints, and AWS Identity and Access Management (IAM) to restrict and monitor access to your Amazon Bedrock capabilities.

By the end of this post, you will have a better understanding of how to configure your AWS landing zone to establish secure and controlled network connectivity to Amazon Bedrock across your organization using VPC Lattice.

Solution overview

Addressing the aforementioned challenges requires a well-designed network architecture and security controls. For this, we use the standard AWS Landing Zone Accelerator networking configuration. It provides a good starting point for managing network communication across multiple accounts. On top of the AWS Landing Zone Accelerator network design, we add two shared accounts.

In this solution design, we create a centralized architecture for managing organization AI capabilities across different accounts. The architecture consists of three main parts that work together to provide secure and controlled access to AI services:

Service network account – This account serves as the central networking hub for the organization, managing network connectivity and access policies. Through this account, network administrators can centrally manage and control access to AI services across the organization. The account follows AWS Landing Zone Accelerator networking practices that scale with enterprise organizational needs.
Generative AI account – This account hosts the organization’s Amazon Bedrock capabilities and serves as the central point for AI/ML management. The organization’s AI/ML scientists and prompt engineers will centrally build and manage Amazon Bedrock capabilities. The account provides access to various large language models (LLMs) through Amazon Bedrock by using VPC interface endpoints, while also enabling centralized monitoring of cost consumption and access patterns.
Workload accounts (dev, test, prod) – These accounts represent different environments where teams develop and deploy applications that consume AI services. Through secure network connections established through the service network account, these workload accounts can access the AI capabilities hosted in the generative AI account. This separation enforces proper isolation between development, testing, and production workloads while maintaining secure access to AI services.

Amazon Bedrock baseline architecture in an AWS landing zone

The following diagram illustrates the solution architecture.

The service network account has its own VPC Lattice service network—a centralized networking construct that enables service-to-service communication across your organization, which is shared with workload accounts using AWS Resource Access Manager (AWS RAM) to enable VPC Lattice Service network sharing.

Workload accounts (dev, test, prod) establish VPC associations with the shared VPC Lattice service network by creating a service network association in their VPC. When an application in these accounts makes a request, it first queries the VPC resolver for DNS resolution. The resolver routes the traffic to the VPC Lattice service network.

Access control is implemented through an VPC Lattice auth policy. The service network policies determine which accounts can access the VPC Lattice service network, and service-level policies control access to specific AI services and define what actions each account can perform.

In the central AI services account, we find the proxy layer, we create a VPC Lattice service that points to a proxy layer, which acts as a single entry point, providing workload accounts access to Amazon Bedrock. This proxy layer then connects to Amazon Bedrock through VPC endpoints. Through this setup, the AI team can configure which foundation models (FMs) are available and manage access permissions for different workload accounts. After the necessary policies and connections are in place, workload accounts can access Amazon Bedrock capabilities through the established secure pathway. This setup enables secure, cross-account access to AI services while maintaining centralized control and monitoring.

Network components

We use VPC Lattice, which is a fully managed application networking service that helps you simplify network connectivity, security, and monitoring for service-to-service communication needs. With VPC Lattice, organizations can achieve a centralized connectivity pattern to control and monitor access to the services required for building generative AI applications.

For details about VPC Lattice, refer to the Amazon VPC Lattice User Guide. The following is an overview of the constructs you can use in setting up the centralized pattern in this solution:

VPC Lattice service network – You can use the VPC Lattice service network to provide central connectivity and security to the central AI services account. The service network is a logical grouping mechanism that simplifies how you can enable connectivity across VPCs or accounts, and apply common security policies for application communication patterns. You can create a service network in an account and share it with other accounts within or outside AWS Organizations using AWS RAM.
VPC Lattice service – In a service network, you can associate a VPC Lattice service, which consists of a listener (protocol and port number), routing rules that allow for control of the application flow (for example, path, method, header-based, or weighted routing), and target group, which defines the application infrastructure. A service can have multiple listeners to meet various client capabilities. Supported protocols include HTTP, HTTPS, gRPC, and TLS. The path-based routing allows control to various high-performing FMs and other capabilities you would need to build a generative AI application.
Proxy layer – You use a proxy layer for the VPC Lattice service target group. The proxy layer can be built based on your organization’s preference of AWS services, such as AWS Lambda, AWS Fargate, or Amazon Elastic Kubernetes Service (Amazon EKS). The purpose of the proxy layer is to provide a single entry point to access LLMs, knowledge bases, and other capabilities that are tested and approved according to your organization’s compliance requirements.
VPC Lattice auth policies – For security, you use VPC Lattice auth policies. VPC Lattice auth policies are specified using the same syntax as IAM policies. You can apply an auth policy to VPC Lattice service network as well as to the VPC Lattice service.
Fully Qualified Domain Names –To facilitate service discovery, VPC Lattice supports custom domain names for your services and resources, and maintains a Fully Qualified Domain Name (FQDN) for each VPC Lattice service and resource you define. You can use these FQDNs in your Amazon Route 53 private hosted zone configurations, and empower business units or teams to discover and access services and resources.
Service network VPC – Business units or teams can access generative AI services in a service network using service network VPC associations or a service network VPC endpoint.
Monitoring – You can choose to enable monitoring at the VPC Lattice service network level and VPC Lattice service level. VPC Lattice generates metrics and logs for requests and responses, making it more efficient to monitor and troubleshoot applications

The preceding guidance takes a “secure by default” approach—you must be explicit about which features, models, and so on should be accessed by which business unit. The setup also enables you to implement a defense-in-depth strategy at multiple layers of the network:

The first level of defense is that business team needs to connect to the service network in order to get access to the generative AI service through the central AI service account.
The second level includes network-level security protections in the business team’s VPC for the service network, such as security groups and network access control lists (ACLs). By using these, you can allow access to specific workloads or teams in a VPC.
The third level is through the VPC Lattice auth policy, which you can apply at two layers: at the service network level to allow authenticated requests within the organization, and at the service level to allow access to specific models and features.

VPC Lattice auth policy

This solution makes it possible to centrally manage access to Amazon Bedrock resources across your organization. This approach uses an VPC Lattice auth policy to centrally control Amazon Bedrock resources and manage it from one location across all the organization accounts.

Typically, the auth policy on the service network is operated by the network or cloud administrator. For example, allowing only authenticated requests from specific workloads or teams in your AWS organization. In the following example, access is granted to invoke the generated AI service for authenticated requests and to principals that are part of the o-123456example organization:

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": "*",
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": "*",
         "Condition": {
            "StringEquals": {
               "aws:PrincipalOrgID": [ 
                  "o-123456example"
               ]
            }
         }
      }
   ]
}

The auth policy at the service level is managed by the central AI service team to set fine-grained controls, which can be more restrictive than the coarse-grained authorization applied at the service network level. For example, the following policy restricts access to claude-3-haiku for only business-team1:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect": "Allow",
         "Principal": {
            "AWS": [
               "arn:aws:iam::<account-number>:role/businss-team1"
            ]
         },
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": [
            "arn:aws:vpc-lattice:<aws-region>:<account-number>:service/svc-0123456789abcdef0/*"
         ],
         "Condition": {
            "StringEquals": {
               "vpc-lattice-svcs:RequestQueryString/modelid": "claude-3-haiku"            }
         }
      }
   ]
}

Monitoring and tracking

This design employs three monitoring approaches, using Amazon CloudWatch, AWS CloudTrail, and VPC Lattice access logs. This strategy provides a view of service usage, security, and performance.

CloudWatch metrics offer real-time monitoring of VPC Lattice service performance and usage. CloudWatch tracks metrics such as request counts and response times for Amazon Bedrock related endpoints, allowing for the setup of alarms for proactive management of service health and capacity. This enables monitoring of overall usage patterns of Amazon Bedrock models across different business units, facilitating capacity planning and resource allocation. CloudTrail provides detailed API-level auditing of Amazon Bedrock related actions. It logs cross-account access attempts and interactions with Amazon Bedrock services, providing a compliance and security audit trail. This tracking of who is accessing which Amazon Bedrock models, when, and from which accounts helps organizations adhere to their organizational policies.VPC Lattice access logs provide detailed insights into HTTP/HTTPS requests to Amazon Bedrock services, capturing specific usage patterns of AI models by different business teams. These logs contain client-specific information, which for example can be used to enable organizations to implement capabilities such as charge-back models. This allows for accurate attribution of AI service usage to specific teams or departments, facilitating fair cost allocation and responsible resource utilization across the organization. These services work together to enhance security, optimize performance, and provide valuable insights for managing cross-account Amazon Bedrock access.

Conclusion

In this post, we explored the importance of securing and controlling network access to Amazon Bedrock capabilities within an organization’s AWS landing zone. We discussed the key business challenges, such as the need to protect sensitive information in Amazon Bedrock knowledge bases, limit access to AI models, and optimize cloud costs by monitoring and controlling Amazon Bedrock capabilities. To address these challenges, we outlined a multi-layered network solution that uses AWS networking services, including a VPC Lattice auth policy to restrict and monitor access to Amazon Bedrock capabilities. Try out this solution for your own use case, and share your feedback in the comments.

About the authors

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

2025-06-23 Konstantina Mavrodimitraki

Post Syndicated from Konstantina Mavrodimitraki original https://aws.amazon.com/blogs/big-data/how-skroutz-handles-real-time-schema-evolution-in-amazon-redshift-with-debezium/

This guest post was co-authored with Kostas Diamantis from Skroutz.

At Skroutz, we are passionate about our product, and it is always our top priority. We are constantly working to improve and evolve it, supported by a large and talented team of software engineers. Our product’s continuous innovation and evolution lead to frequent updates, often necessitating changes and additions to the schemas of our operational databases.

When we decided to build our own data platform to meet our data needs, such as supporting reporting, business intelligence (BI), and decision-making, the main challenge—and also a strict requirement—was to make sure it wouldn’t block or delay our product development.

We chose Amazon Redshift to promote data democratization, empowering teams across the organization with seamless access to data, enabling faster insights and more informed decision-making. This choice supports a culture of transparency and collaboration, as data becomes readily available for analysis and innovation across all departments.

However, keeping up with schema changes from our operational databases, while updating the data warehouse without constantly coordinating with development teams, delaying releases, or risking data loss, became a new challenge for us.

In this post, we share how we handled real-time schema evolution in Amazon Redshift with Debezium.

Solution overview

Most of our data resides in our operational databases, such as MariaDB and MongoDB. Our approach involves using the change data capture (CDC) technique, which automatically handles the schema evolution of the data stores being captured. For this, we used Debezium along with a Kafka cluster. This solution enables schema changes to be propagated without disrupting the Kafka consumers.

However, handling schema evolution in Amazon Redshift became a bottleneck, prompting us to develop a strategy to address this challenge. It’s important to note that, in our case, changes in our operational databases primarily involve adding new columns rather than breaking changes like altering data types. Therefore, we have implemented a semi-manual process to resolve this issue, along with a mandatory alerting mechanism to notify us of any schema changes. This two-step process consists of handling schema evolution in real time and handling data updates in an asynchronous manual step. The following architectural diagram illustrates a hybrid deployment model, integrating both on-premises and cloud-based components.

End-to-end data migration workflow from on-premises databases to AWS cloud using CDC, messaging, and data warehouse services

The data flow begins with data from MariaDB and MongoDB, captured using Debezium for CDC in near real-time mode. The captured data is streamed to a Kafka cluster, where Kafka consumers (built on the Ruby Karafka framework) read and write them to the staging area, either in Amazon Redshift or Amazon Simple Storage Service (Amazon S3). From the staging area, DataLoaders promote the data to production tables in Amazon Redshift. At this stage, we apply the slowly changing dimension (SCD) concept to these tables, using Type 7 for most of them.

In data warehousing, an SCD is a dimension that stores data, and though it’s generally stable, it might change over time. Various methodologies address the complexities of SCD management. SCD Type 7 places both the surrogate key and the natural key into the fact table. This allows the user to select the appropriate dimension records based on:

The primary effective date on the fact record
The most recent or current information
Other dates associated with the fact record

Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The following diagram provides an example of the data modeling process from a staging table to a production table.

Database schema evolution: staging.shops to production.shops with added temporal and versioning columns

The architecture depicted in the diagram shows only our CDC pipeline, which fetches data from our operational databases and doesn’t include other pipelines, such as those for fetching data through APIs, scheduled batch processes, and many more. Also note that our convention is that dw_* columns are used to catch SCD metadata information and other metadata in general. In the following sections, we discuss the key components of the solution in more detail.

Real-time workflow

For the schema evolution part, we focus on the column dw_md_missing_data, which captures schema evolution changes in near real time that occur in the source databases. When a new change is produced to the Kafka cluster, the Kafka consumer is responsible for writing this change to the staging table in Amazon Redshift. For example, a message produced by Debezium to the Kafka cluster will have the following structure when a new shop entity is created:

{
  "before": null,
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden"
  },
  "source": {
    ...
    "ts_ms": "1704114000000",
    ...
  },
  "op": "c",
  ...
}

The Kafka consumer is responsible for preparing and executing the SQL INSERT statement:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 13:00:00',
    'create',
    NULL
  )
;

After that, let’s say a new column is added to the source table called new_column, with the value new_value.
The new message produced to the Kafka cluster will have the following format:

{
  "before": { ... },
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden",
    "new_column": "new_value"
  },
  "source": {
    ...
    "ts_ms": "1704121200000"
    ...
  },
  "op": "u"
  ...
}

Now the SQL INSERT statement executed by the Kafka consumer will be as follows:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 15:00:00',
    'update',
    JSON_PARSE('{"new_column": "new_value"}') /* <-- check this */
  )
;

The consumer performs an INSERT as it would for the known schema, and anything new is added to the dw_md_missing_data column as key-value JSON. After the data is promoted from the staging table to the production table, it will have the following structure.

Production.shops table displaying temporal data versioning with creation, update history, and current state indicators

At this point, the data flow continues running without any data loss or the need for communication with teams responsible for maintaining the schema in the operational databases. However, this data might not be easily accessible for the data consumers, analysts, or other personas. It’s worth noting that dw_md_missing_data is defined as a column of the SUPER data type, which was introduced in Amazon Redshift to store semistructured data or documents as values.

Monitoring mechanism

To track new columns added to a table, we have a scheduled process that runs weekly. This process checks for tables in Amazon Redshift with values in the dw_md_missing_data column and generates a list of tables requiring manual action to make this data available through a structured schema. A notification is then sent to the team.

Manual remediation steps

In the aforementioned example, the manual steps to make this column available would be:

Add the new columns to both staging and production tables:

ALTER TABLE staging.shops ADD COLUMN new_column varchar(255);
ALTER TABLE production.shops ADD COLUMN new_column varchar(255);

Update the Kafka consumer’s known schema. In this step, we just need to add the new column name to a simple array list. For example:

class ShopsConsumer < ApplicationConsumer
  SOURCE_COLUMNS = [
    'id',
    'name',
    'state',
    'new_column' # this one is the new column
  ]
 
  def consume
    # Ruby code for:
    #   1. data cleaning
    #   2. data transformation
    #   3. preparation of the SQL INSERT statement
 
    RedshiftClient.conn.exec <<~SQL
      /*
        generated SQL INSERT statement
      */
    SQL
  end
end

Update the DataLoader’s SQL logic for the new column. A DataLoader is responsible for promoting the data from the staging area to the production table.

class DataLoader::ShopsTable < DataLoader::Base
  class << self
    def load
      RedshiftClient.conn.exec <<~SQL
        CREATE TABLE staging.shops_new (LIKE staging.shops);
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        /*
          We move the data to a new table because in staging.shops
          the Kafka consumer will continue add new rows
        */
        ALTER TABLE staging.shops_new APPEND FROM staging.shops;
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        BEGIN;
          /*
            SQL to handle
              * data deduplications etc
              * more transformations
              * all the necessary operations in order to apply the data modeling we need for this table
          */
 
          INSERT INTO production.shops (
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            dw_start_date,
            dw_end_date,
            dw_current,
            dw_md_changed_at,
            dw_md_operation,
            dw_md_missing_data
          )
          SELECT
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            /*
              here is the logic to apply the data modeling (type 1,2,3,4...7)
            */
          FROM
            staging.shops_new
          ;
 
          DROP TABLE staging.shops_new;
        END TRANSACTION;
      SQL
    end
  end
end

Transfer the data that has been loaded in the meantime from the dw_md_missing_data SUPER column to the newly added column and then clean up. In this step, we just need to run a data migration like the following:

BEGIN;
 
  /*
    Transfer the data from the `dw_md_missing_data` to the corresponding column
  */
  UPDATE production.shops
  SET new_column = dw_md_missing_data.new_column::varchar(255)
  WHERE dw_md_missing_data.new_column IS NOT NULL;
 
  /*
    Clean up dw_md_missing_data column
  */
  UPDATE production.shops
  SET dw_md_missing_data = NULL
  WHERE dw_md_missing_data IS NOT NULL;
 
END TRANSACTION;

To perform the preceding operations, we make sure that no one else performs changes to the production.shops table because we want no new data to be added to the dw_md_missing_data column.

Conclusion

The solution discussed in this post enabled Skroutz to manage schema evolution in operational databases while seamlessly updating the data warehouse. This alleviated the need for constant development team coordination and removed risks of data loss during releases, ultimately fostering innovation rather than stifling it.

As the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the current architecture can be adapted to align more closely with AWS-centered principles. To that end, one of the changes being considered is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open source Kafka, which will make it possible for Skroutz to process large volumes of streaming data from multiple sources with low latency and high throughput to derive insights in seconds.

If you face similar challenges, discuss with an AWS representative and work backward from your use case to provide the most suitable solution.

About the authors

Konstantina Mavrodimitraki is a Senior Solutions Architect at Amazon Web Services, where she assists customers in designing scalable, robust, and secure systems in global markets. With deep expertise in data strategy, data warehousing, and big data systems, she helps organizations transform their data landscapes. A passionate technologist and people person, Konstantina loves exploring emerging technologies and supports the local tech communities. Additionally, she enjoys reading books and playing with her dog.

Kostas Diamantis is the Head of the Data Warehouse at Skroutz company. With a background in software engineering, he transitioned into data engineering, using his technical expertise to build scalable data solutions. Passionate about data-driven decision-making, he focuses on optimizing data pipelines, enhancing analytics capabilities, and driving business insights.

Running and optimizing small language models on-premises and at the edge

2025-06-23 Chris McEvilly

Post Syndicated from Chris McEvilly original https://aws.amazon.com/blogs/compute/running-and-optimizing-small-language-models-on-premises-and-at-the-edge/

As you move your generative AI implementations from prototype to production, you may discover the need to run foundation models (FMs) on-premises or at the edge to address data residency, information security (InfoSec) policy, or low latency requirements. For example, customers in regulated industries, such as financial services, healthcare, and telecom, may want to leverage chatbots that support customer queries, optimize internal workflows for complex reporting, and automatically approve requests—while keeping their data in-country. Similarly, some organizations choose to deploy their own small language models (SLMs) to align with stringent internal InfoSec requirements. As another example, manufacturers may want to deploy SLMs right inside their factories to analyze production data and provide real-time equipment diagnostics. To address users’ data residency, latency, and InfoSec needs, this post provides guidance on deploying generative AI FMs into AWS Local Zones and AWS Outposts. The goal is to present a framework for running a range of SLMs that address local data processing requirements-based customer engagements.

Generative AI deployment options

The growth in generative AI deployments and experimentation has accelerated with two business deployment options. The first is the use of a large language model (LLM) to support enterprise needs. LLMs are incredibly flexible: one model can perform completely different tasks, such as answering questions, coding, summarizing documents, translating languages, and generating content. LLMs have the potential to disrupt content creation and the way people use search engines and virtual assistants. The second deployment model is the use of SLMs, addressing a specific use case. SLMs are compact transformer models that primarily use decoder-only or encoder-decoder architectures, with typically less than 20 billion parameters, though this definition is evolving as larger models become available. SLMs can achieve comparable or even superior performance when fine-tuned for specific domains or tasks, making them an excellent alternative for specialized applications.

In addition, SLMs offer faster inference times, lower resource requirements, and are suitable for deployment on a wider range of devices, making them particularly valuable for specialized applications and edge computing where space and power supply are limited. While SLMs are limited in scope and accuracy compared to LLMs, you can enhance their performance for specific tasks through Retrieval Augmented Generation (RAG) and fine-tuning. This combination produces an SLM capable of responding to queries relating to a specific domain with a comparable level of accuracy to an LLM, while reducing hallucinations. SLMs offer effective solutions that balance user needs with cost efficiency.

Architecture overview

The solution discussed in this blog post uses Llama.cpp, an optimized framework written in C/C++ to efficiently run a range of SLMs. Llama.cpp can run efficiently in a wide range of computing environments, enabling generative AI models to operate in Local Zones or on Outposts without the large Graphics Processing Unit (GPU) clusters typically associated with LLMs in their native frameworks. This framework expands your model choice and increases model performance when deploying SLMs to Local Zones and Outposts.

The architecture provides a template for deploying a range of SLMs supporting chatbot or generation use cases. The solution consists of a front-end application receiving user queries, formatting the prompts for presentation to the model, and delivering the resulting responses. To support a scalable solution, the application servers and Amazon EC2 G4dn GPU-enabled instances are front-ended with an Application Load Balancer (ALB).

For a deployment where the incoming prompts exceed the ability of SLMs to service the requests, a message queue can be deployed to the front end of the SLMs. As an example, you could deploy a RabbitMQ cluster to serve as your queue manager.

Figure 1: Architecture overview

Solution deployment

The following instructions show how to launch an SLM using Llama.cpp in Local Zones or on Outposts. Although the preceding architecture overview shows a complete solution with multiple components, this post focuses specifically on the steps necessary to deploy the SLM in an EC2 instance using Llama.cpp.

Prerequisites

For this solution, you need the following prerequisites:

An AWS account that is either allowlisted for Local Zones or has a logical Outpost installed, configured, and running
Access to G4dn instances in your account at the chosen location (check availability in AWS Service Quotas)
A VPC created to host the environment
Public and private subnets to support the environment in the VPC
A security group that is associated with your EC2 instance
AWS Identity and Access Management (IAM) role with AWS Systems Manager Session Manager permissions

1. Launch a GPU instance for your SLM

Sign in to the AWS Management Console, open the Amazon EC2 console, and launch a g4dn.12xlarge EC2 instance in your Local Zone or Outposts environment. Configure it with:

Red Hat Enterprise Linux 9 (HVM), SSD Volume Type
A private subnet associated with your Local Zone or Outposts rack
30 GiB gp2 root volume and an additional 300 GiB gp2 EBS volume
The IAM role configured with permissions required for Systems Manager
SSM Agent to connect to the instance (follow the instructions in Install SSM Agent on RHEL 8.x and 9.x in the Systems Manager User Guide)

For detailed EC2 launch instructions, see Launch an EC2 instance using the launch instance wizard in the console or Launch an instance on your Outposts rack.

Figure 2: SLM instance launched

2. Install NVIDIA drivers

Connect to the SLM instance using Systems Manager. You can follow the instructions in Connect to your Amazon EC2 instance using Session Manager in the Amazon EC2 User Guide.

Install kernel packages and tools.

sudo su –

dnf update -y

subscription-manager repos --enable codeready-builder-for-rhel-9-x86_64-rpms

dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

dnf install -y ccache cmake gcc-c++ git git-lfs htop python3-pip unzip wget

dnf install -y dkms elfutils-libelf-devel kernel-devel kernel-modules-extra \
libglvnd-devel vulkan-devel xorg-x11-server-Xorg

systemctl enable --now dkms

reboot

Install Miniconda3 in the /opt/miniconda3 folder or install a compatible package manager to manage Python dependencies.

Install the NVIDIA drivers.

dnf config-manager --add-repo \
http://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

dnf module install -y nvidia-driver:latest-dkms

dnf install -y cuda-toolkit

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc

echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

source ~/.bashrc

3. Download and install Llama.cpp

Create and mount the filesystem of the Amazon EBS volume you created earlier in the /opt/slm folder. The instructions are available in Make an Amazon EBS volume available for use in the Amazon EBS User Guide.

Run the following commands to download and install Llama.cpp.

cd /opt/slm

git clone -b b4942 \
https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

cmake -B build -DGGML_CUDA=ON

cmake --build build --config Release -j$(nproc)

conda install python=3.12

pip install -r requirements.txt

pip install nvitop

4. Download and convert the SLM model

To run the SLM efficiently with Llama.cpp, you need to convert the model format to GGUF (GPT-Generated Unified Format). This conversion optimizes the model for better performance and memory usage on edge deployments with limited resources. GGUF is specifically designed to work with Llama.cpp’s inference engine.The following instructions show how to download SmolLM2 1.7B and convert it to the GGUF format:

mkdir /opt/slm/models
cd /opt/slm/models

git lfs install
git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct

cd /opt/slm/llama.cpp

python3 convert_hf_to_gguf.py --outtype f16 \
--outfile /opt/slm/llama.cpp/models/SmolLM2-1.7B-Instruct-f16.gguf \
/opt/slm/models/SmolLM2-1.7B-Instruct/

echo 'export PATH=/opt/slm/llama.cpp/build/bin:$PATH' >> ~/.bashrc

echo 'export LD_LIBRARY_PATH=/opt/slm/llama.cpp/build/bin:$LD_LIBRARY_PATH' \
>> ~/.bashrc

source ~/.bashrc

You can also download other publicly available models from Hugging Face as needed, following the same conversion process.

SLM operation and optimization

The deployment of SLMs through Llama.cpp provides operational flexibility through a significant range of environment customization options, allowing you to tailor model behavior to your specific use cases. With Llama.cpp, you can select a range of parameters that can be used to optimize the use of system resources and model operation. This enables models to efficiently use the available resources without consuming unneeded resources or negatively impacting their operation. The following parameters are commonly supported when running Llama.cpp to provide control over how the model runs.

-ngl N, --n-gpu-layers N: When compiled with GPU support, this option allows you to offload some layers to the GPU for computation. This generally results in increased performance.
-t N, --threads N: Sets the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores that your system has.
-n N, --n-predict N: Sets the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
-sm, --split-mode: Specifies how to split the model across multiple GPUs when running in a multi-GPU environment. Consider testing the ‘row’ splitting mode as it may provide better performance in some scenarios compared to the default layer-based splitting.
--temp N: Temperature is a parameter that provides control over an SLM’s randomness in terms of its outputs. Changing the value from the default of 0.8 changes the distribution of the model’s predictions. Lower values sharpen and deterministically enhance the model’s output, preferring high-probability phrases. On the other hand, a higher temperature fosters more inventiveness and unpredictability, allowing for a wider range of diverse responses. (default: 0.8).
-s SEED, --seed SEED: Provides a method to control the randomness of the SLM’s response. Setting a specific seed value produces consistent responses across multiple runs, while using the default value (-1) generates varied responses. (default: -1, -1 = random seed).
-c, --ctx-size N: Context size is defined as the number of tokens an FM can process as one prompt (in tokens) into the model. This value reserves memory in the host for the maximum size context length. Reducing the context size can reduce the amount of RAM needed to load and run a model. Models with larger context sizes (>50k) consume more RAM, and in some models, a larger context size can impact accuracy. For example, with Phi-3, it’s recommended to reduce the context size to 8k or 16k. The command to achieve this is –ctx-size XXXX, where XXXX is the context size.

This section demonstrates how to optimize SLM performance for specific use cases using Llama.cpp. We’ll cover two common scenarios: chatbot interactions and text summarization.

Chatbot Use Case Example

Token Size Requirements

For chatbot applications, typical token size requirements include input sizes ranging from 50-150 tokens, which support user questions up to a couple of sentences, and output sizes of approximately 100-300 tokens, allowing the model to provide detailed but concise responses.

Sample Command

./build/bin/llama-cli -m ./models/SmolLM2-1.7B-Instruct-f16.gguf \
-ngl 99 -n 512 --ctx-size 8192 -sm row --temp 0 --single-turn \
-sys "You are a helpful assistant" -p "Hello"

Figure 3: Chatbot example

Command Explanation

-m ./models/SmolLM2-1.7B-Instruct-f16.gguf: Specifies the model file to use
-ngl 99: Allocates GPU layers for optimal performance
-n 512: Sets maximum output tokens to 512, sufficient for the 100-300 tokens typically needed
--ctx-size 8192: Defines context window size, allowing for complex conversation history
-sm row: Splits the rows across GPUs
--temp 0: Sets the temperature to zero to be less creative
--single-turn: Optimizes for single-response interactions
-sys "You are a helpful assistant": Sets the system prompt to define assistant behavior
-p "Hello": Provides the user input prompt

Text summarization example

This command line shows SmolLM2-1.7B running to support a summarization activity.

PROMPT_TEXT="Summarize the following text: Amazon DynamoDB is a serverless, 
NoSQL database service that allows you to develop modern applications 
at any scale. As a serverless database, you only pay for what you use 
and DynamoDB scales to zero, has no cold starts, no version upgrades, 
no maintenance windows, no patching, and no downtime maintenance. 
DynamoDB offers a broad set of security controls and compliance 
standards. For globally distributed applications, DynamoDB global 
tables is a multi-Region, multi-active database with a 99.999% 
availability SLA and increased resilience. DynamoDB reliability is 
supported with managed backups, point-in-time recovery, and more. 
With DynamoDB streams, you can build serverless event-driven applications."

./build/bin/llama-cli -m ./models/SmolLM2-1.7B-Instruct-f16.gguf \
-ngl 99 -n 512 --ctx-size 8192 -sm row --single-turn \
-sys "You are a technical writer" \
--prompt "$PROMPT_TEXT"

Figure 4: Summarization example

Cleaning up

To avoid ongoing charges, take these steps to delete the resources you created by following this post, if they are no longer needed.

Terminate your EC2 instance to avoid unnecessary charges. Furthermore, verify that the 300 GiB EBS volume you attached is properly deleted by checking the Volumes section under Elastic Block Store and manually delete any remaining volumes by choosing them and choosing Actions > Delete volume.

Conclusion

This post took you through the steps of deploying SLMs into AWS on-premises or edge environments to address local data processing needs. The post first discussed business benefits of SLMs, including faster inference time, reduced operational costs, and improved model outcomes. SLMs launched using Llama.cpp and optimized for specific use cases can effectively deliver user services from the edge in a scalable manner. The optimization parameters described in this post provide multiple ways to configure these models for different deployment use cases. You can follow the deployment steps and techniques in this blog post to implement generative AI capabilities that align with your specific data residency, latency, or InfoSec compliance needs while operating efficiently within the resource constraints of edge environments. Visit the following pages to learn more about AWS Local Zones and AWS Outposts.

Streamline Operational Troubleshooting with Amazon Q Developer CLI

2025-06-20 Kirankumar Chandrashekar

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/streamline-operational-troubleshooting-with-amazon-q-developer-cli/

Amazon Q Developer is the most capable generative AI–powered assistant for software development, helping developers perform complex workflows. Amazon Q Developer command-line interface (CLI) combines conversational AI with direct access to AWS services, helping you understand, build, and operate applications more effectively. The Amazon Q Developer CLI executes commands, analyzes outputs, and provides contextual recommendations based on best practices for troubleshooting tools and platforms available on your local machine.

In today’s cloud-native environments, troubleshooting production issues often involves juggling multiple terminal windows, parsing through extensive log files, and navigating numerous AWS console pages. This constant context-switching delays problem resolution and adds cognitive burden to teams managing cloud infrastructure.

In this blog post, you will explore how Amazon Q Developer CLI transforms the troubleshooting experience by streamlining challenging scenarios through conversational interactions.

The Traditional Troubleshooting Experience

When issues arise, engineers typically spend hours manually examining infrastructure configurations, reviewing logs across services, and analyzing error patterns. The process requires switching between multiple interfaces, correlating information from various sources, and deep AWS knowledge. This complex workflow often extends problem resolution from hours into days and increase the burden on the infrastructure teams.

Solution: Amazon Q Developer CLI

Amazon Q Developer CLI streamlines the entire troubleshooting process, from initial investigation to problem resolution, making complex AWS troubleshooting accessible and efficient through simple conversations.

How Amazon Q Developer CLI works:

Natural Language Interface: Execute AWS CLI commands and interact with AWS services using conversational prompts
Automated Discovery: Map out infrastructure and analyze configurations
Intelligent Log Analysis: Parse, correlate, and analyze logs across services
Root Cause Identification: Pinpoint issues through AI-powered reasoning
Guided Remediation: Implement fixes with minimal human intervention
Validation: Test solutions and explain complex issues simply

One of the built-in tools within the Amazon Q Developer CLI, use_aws, enables natural language interaction with AWS services, as shown in Figure 1. This tool leverages the AWS CLI permissions configured on your local machine, allowing secure and authorized access to your AWS resources.

Figure 1: Tools selection in Amazon Q Developer CLI

Real-World Troubleshooting Scenario

Demonstration Environment Setup

This demonstration was performed with the following environment configuration:

Application and infrastructure code available locally
Amazon Q Developer CLI installed
AWS CLI configured with demo-profile
AWS permissions configured for: Amazon Elastic Container Service(ECS), Amazon CloudWatch Logs, Application Load Balancer (ALB), and AWS Cloud Development Kit(CDK) deployment
Amazon Q Developer CLI launched within the project directory

The environment includes a local development machine with necessary tools, appropriate AWS account permissions, and terminal access. By starting Amazon Q Developer CLI in the project directory, it has immediate access to relevant code and configuration files.

Scenario: Troubleshooting NGINX 5XX Errors

The scenario demonstrates troubleshooting a multi-tier application architecture as shown in figure 2 deployed on Amazon ECS Fargate with:

Application Load Balancer (ALB) distributing traffic across availability zones
NGINX reverse proxy service handling incoming requests
Node.js backend service processing business logic
Service discovery enabling internal communication
CloudWatch Logs providing centralized logging

Figure 2: AWS Architecture diagram for the app used in this blog post

Traditional Troubleshooting Steps

For the architecture in figure 2, when 502 Gateway Timeout errors occur, traditional troubleshooting requires:

Checking ALB target group health
Examining ECS service status across multiple consoles
Analyzing CloudWatch logs from different log groups
Correlating error patterns between services
Reviewing infrastructure code for configuration issues
Implementing and deploying fixes

Amazon Q Developer CLI Approach

Instead, let’s see how Amazon Q Developer CLI handles this systematically, step by step:

Step1: Initial Problem Report

Amazon Q Developer CLI is provided with the initial prompt as a problem statement within the application project directory as shown in the following screenshot in figure 3. Amazon Q Developer responds back and says it is going investigate the 502 Gateway Timeout errors in the NGINX application.

Prompt:

Our production NGINX application is experiencing 502 Gateway Timeout errors. 
I have checked out the application and infrastructure code locally and the AWS CLI 
profile 'demo-profile' is configured with access to the AWS account where the 
infrastructure and application is deployed to. Can you help investigate and diagnose the issue?

Figure 3: Amazon Q Developer CLI with initial prompt and problem statement

Step2: Systematic Infrastructure Discovery

Amazon Q Developer CLI start to systematically discovering the infrastructure as shown in the following screenshot in figure 4. If you see the initial prompt did not include that the app is hosted on ECS, but Amazon Q Developer CLI understood the context and executes the AWS CLI calls to describe the Cluster and the services within it. It made sure that the ECS tasks are running for both the services within the Cluster. It is a key discovery that both services show healthy status (1/1 desired count), indicating the issue isn’t service availability.

Figure 4: AWS Infrastructure discovery by Amazon Q Developer CLI

Step 3: Intelligent Log Analysis

Amazon Q Developer CLI retrieves and analyzes recent CloudWatch logs from the NGINX container, immediately identifying the critical error pattern as shown in the following screenshot in figure 5, where Amazon Q Developer responds: “Perfect! I found the issue. The NGINX logs show clear 504 gateway timeout with upstream timeout messages.”

Figure 5: CloudWatch Log analysis by Amazon Q Developer CLI

Step 4: Amazon Q Developer CLI Analysis and Root Cause Identification

Amazon Q Developer examines backend service logs and discovers a mismatch between the backend service response time and NGINX timeout settings, as seen in the following screenshot in figure 6.

Figure 6: Root cause identification by Amazon Q Developer CLI

Step 5: Amazon Q Developer CLI Root Cause Analysis

Amazon Q Developer CLI examines the ECS task definitions to identify the exact configuration mismatch, as shown in the following screenshot in figure 7. Amazon Q Developer finds that:

Backend service is configured with response_delay=15000 (15 secs)
NGINX proxy is configured with proxy_read_timeout 10s

This mismatch causes 504 gateway timeout errors when the backend response exceeds NGINX’s timeout threshold.

Figure 7: Root cause analysis and issue detection by Amazon Q Developer CLI

Step 6: Automated Code Fix

Here’s where Amazon Q Developer CLI truly excels—it doesn’t just diagnose; it implements the fix. Since Amazon Q Developer CLI is started within the project where the CDK code for ECS task definition is defined, it identified the code configuration and also modified it, as shown in the following screenshot in figure 8.

Figure 8: CDK code fix by Amazon Q Developer CLI

Step 7: Deployment

Amazon Q Developer CLI builds and deploys the fix by executing cdk synth and cdk deploy using the ‘demo-profile‘ AWS CLI profile that was initially provided in the prompt, as shown in the following screenshot in figure 9.

Figure 9: CDK code build and deployment by Amazon Q Developer CLI

Step 8: Validation

Amazon Q Developer CLI validates the solution by sending a curl request to the ALB endpoint after the successful deployment, as shown in the following screenshot in figure 10.

A terminal window showing the execution of a curl command to test an NGINX application on AWS. The command targets an Elastic Load Balancer in the us-west-2 region. The response shows a successful HTTP 200 OK status after 14 seconds, with a JSON response containing the message "Hello from backend". The test completes in 15.100 seconds, indicating the fix for previous 502 errors was successful.

Figure 10: Fix validation by Amazon Q Developer CLI

In addition to that, Amazon Q Developer also sends a request to the health check endpoint and validates everything is working after the fix was deployed, as shown in the following screenshot in figure 11.

A terminal screenshot showing the results of a health check on an Nginx server using curl. The command executed shows a successful response with "healthy" status, completing in 0.65 seconds. The output displays various metrics including download speed (386 B/s), 100% completion rate, and timing statistics for real, user, and system processes.

Figure 11: Health endpoint validation by Amazon Q Developer CLI

What Amazon Q Developer CLI Accomplished

Using just conversational commands, Amazon Q Developer CLI performed a complete troubleshooting cycle:

Infrastructure Discovery: Automatically mapped ECS clusters, services, and dependencies
Log Correlation: Analyzed thousands of log entries across multiple services
Root Cause Analysis: Identified exact configuration mismatch between NGINX’s timeout (10s) and the backend’s response delay (15s)
Code-Level Diagnosis: Located problematic timeout setting in CDK infrastructure code
Automated Implementation: Modified infrastructure code to increase the NGINX timeout
End-to-End Deployment: Built, deployed, and validated the complete solution
Comprehensive Testing: Verified both fix effectiveness and overall system health

Amazon Q Developer CLI handles troubleshooting tasks through a single, conversational interface, eliminating the need for multiple tools or AWS CLI commands.

Conclusion

Amazon Q Developer CLI represents a significant evolution in how we troubleshoot cloud infrastructure issues. By combining natural language understanding with powerful command execution capabilities, it transforms complex troubleshooting workflows into efficient, action-oriented dialogues. Whether you’re dealing with NGINX 5XX errors or similar issues across other AWS services, Amazon Q Developer CLI can help you diagnose issues, implement fixes, and validate solutions—all through a conversational interface that feels natural and intuitive.

Give Amazon Q Developer CLI a try the next time you encounter a troubleshooting challenge, and experience the difference it can make in your operational workflow.

To learn more about Amazon Q Developer’s features and pricing details, visit the Amazon Q Developer product page.

About the Author

Kirankumar Chandrashekar is a Generative AI Specialist Solutions Architect at AWS, focusing on Amazon Q Developer. Bringing deep expertise in AWS cloud services, DevOps, modernization, and infrastructure as code, he helps customers accelerate their development cycles and elevate developer productivity through innovative AI-powered solutions. By leveraging Amazon Q Developer, he enables teams to build applications faster, automate routine tasks, and streamline development workflows. Kirankumar is dedicated to enhancing developer efficiency while solving complex customer challenges, and enjoys music, cooking, and traveling.

Accelerate development with secure access to Amazon Q Developer using PingIdentity

2025-06-20 Sid Vantair

Post Syndicated from Sid Vantair original https://aws.amazon.com/blogs/devops/accelerate-development-with-secure-access-to-amazon-q-developer-using-pingidentity/

Overview

Customers adopting Amazon Q Developer, a generative AI-powered coding companion, often need authentication through existing identity providers like PingIdentity. By leveraging AWS IAM Identity Center, organizations can enable their developers to access Amazon Q Developer with their existing PingIdentity credentials, streamlining authentication and removing the need for separate login procedures. Amazon Q Developer can chat about code, provide inline code completions, and generate new code. It also scans your code for security vulnerabilities and makes code improvements, including language updates, debugging, and optimizations. Amazon Q Developer comes in two tiers. The Free Tier is available at no cost for individual use. The Pro Tier is a paid version offering enterprise access controls, an analytics dashboard, customization, and higher usage limits. Organizations that enable the Pro tier of Amazon Q Developer for their developers typically authenticate with AWS IAM Identity Center. This approach is popular due to its ability to federate with external identity providers. In this blog, we will show you how to set up PingIdentity as an external IdP for IAM Identity Center and allow developers to access Amazon Q Developer using their existing PingIdentity login credentials.

How it works

AWS authentication flow diagram: Developers interact with Amazon Q Developer and AWS IAM Identity Center, integrating with Ping Identity for SAML-based access.

Figure 1 – Solution Overview

The authentication workflow is as follows:

The developer initiates an access request to Amazon Q Developer.
IAM Identity center checks authentication status.
If not authenticated, redirects to PingIdentity login.
Developer provides PingIdentity Credentials.
PingIdentity validates credentials and sends SAML response.
IAM Identity Center verifies the SAML response.
Upon successful verification, grants Amazon Q Developer access.
Developer begins using Amazon Q Developer.

Prerequisites

AWS account
PingIdentity environment with users and groups already setup for Amazon Q Developer access
IAM identity center
Pro Tier subscription of Amazon Q Developer

Walkthrough

In this section, we demonstrate how to create a SAML-based connection between PingIdentity and IAM Identity Center, enabling you to access Amazon Q Developer seamlessly using your PingIdentity credentials.

Note: You will need to switch between PingIdentity portal and IAM Identity Center in your browser. We recommend opening a new browser tab for each console.

Step 1: Enable AWS Single Sign-On in PingIdentity

This step involves enabling AWS Single Sign-On application within PingIdentity.

1. In the PingIdentity console, Navigate to the Applications Tab > Application Catalog
2. Browse catalog for AWS Single Sign-On and select + to start the Quick Setup.

Screenshot of the PingIdentity Application Catalog interface. The search term "aws" is entered in the search bar, displaying three results: Amazon Web Services – AWS, AWS Gov-Cloud, and AWS Single Sign-On. The "AWS Single Sign-On" option is outlined with a red box and includes a plus button to add the application

Figure 2 – PingIdentity Application Catalog

Alt Text: Screenshot of the PingIdentity Application Catalog interface. The search term “aws” is entered in the search bar, displaying three results: Amazon Web Services – AWS, AWS Gov-Cloud, and AWS Single Sign-On. The “AWS Single Sign-On” option is outlined with a red box and includes a plus button to add the application

1. Provide Name, SSO Region and SSO Tenant ID and choose Next
  - Name – Input an appropriate name for the connection
  - SSO Region – Input the appropriate region
  - Tenant ID – Identity Store ID
    You can run the following CLI command to retrieve the value. It’s a 10-digit alphanumeric prefixed by “d-“.

aws sso-admin list-instances –query ‘Instances[0].IdentityStoreId’Output: “d-XXXXXXXXXX”

1. Navigate to PingOne Mappings and select Email Address from the drop down.

Screenshot of the AWS Single Sign-On configuration in PingIdentity. The screen shows Step 2 of the setup process where the SAML attribute SAML_SUBJECT is mapped to the PingOne attribute "Email Address". A red box highlights the mapping section under "PingOne Mappings".

Figure 3 – AWS Single Sign-On attribute mapping

Alt Text: Screenshot of the AWS Single Sign-On configuration in PingIdentity. The screen shows Step 2 of the setup process where the SAML attribute SAML_SUBJECT is mapped to the PingOne attribute “Email Address”. A red box highlights the mapping section under “PingOne Mappings”.

1. Search and select the group that you have created earlier for enabling access to Amazon Q Developer and select + to add the group.
2. Choose Save

Screenshot of Step 3 in the AWS Single Sign-On setup process in PingIdentity. The screen shows the group selection interface where the "Amazon Q" group is listed. A plus icon is shown next to the group to add it, and a blue "Save" button is highlighted in the bottom-right corner to confirm the configuration.

Figure 4 – Select PingIdentity directory Groups for Amazon Q Developer access

Alt Text: Screenshot of Step 3 in the AWS Single Sign-On setup process in PingIdentity. The screen shows the group selection interface where the “Amazon Q” group is listed. A plus icon is shown next to the group to add it, and a blue “Save” button is highlighted in the bottom-right corner to confirm the configuration.

Step 2: Connecting PingIdentity with IAM identity Center

This step involves configuring PingIdentity with the AWS IAM Identity Center sign-on details to complete the authentication setup.

In the PingIdentity console, Navigate to the Applications Tab > Applications and select the application you created earlier in Step 1
Select Enable Advanced Configuration and choose Enable.

Screenshot of the PingIdentity Applications dashboard showing the AWS Single Sign-On application selected. The overview panel displays key configuration sections including protocol (SAML), mapped attributes, selected policies, and access group (Amazon Q). The option "Enable Advanced Configuration" is highlighted near the bottom of the panel.

Figure 5 – Enable Advanced configuration for AWS single Sign-On application

Alt Text: Screenshot of the PingIdentity Applications dashboard showing the AWS Single Sign-On application selected. The overview panel displays key configuration sections including protocol (SAML), mapped attributes, selected policies, and access group (Amazon Q). The option “Enable Advanced Configuration” is highlighted near the bottom of the panel.

Scroll down and select Download Metadata. This will save the Metadata file to your local computer, which you will use later during the configuration process.
In another browser tab login to your AWS IAM Identity Center console and Select Choose your identity source.
Under Identity source, select Change identity source from the Actions drop-down menu.

Screenshot of the IAM Identity Center settings page, focused on the "Identity source" tab. The page displays details such as identity source, authentication method, AWS access portal URL, issuer URL, and identity store ID. A dropdown menu labeled "Actions" is expanded in the top-right corner, showing options to "Customize AWS access portal URL" and "Change identity source," highlighted with a red box.

Figure 6 – Change identity source in IAM Identity Center Console

Alt Text: Screenshot of the IAM Identity Center settings page, focused on the “Identity source” tab. The page displays details such as identity source, authentication method, AWS access portal URL, issuer URL, and identity store ID. A dropdown menu labeled “Actions” is expanded in the top-right corner, showing options to “Customize AWS access portal URL” and “Change identity source,” highlighted with a red box.

On the next page, select External identity provider and choose Next.
Under Service provider metadata copy the IAM Identity Center Assertion Consumer Service (ACS) URL.

Figure 7 – Copy IAM Identity Center ACS URL

Alt Text: Screenshot of the “Configure external identity provider” step in the AWS IAM Identity Center setup process. The screen displays service provider metadata including the AWS access portal sign-in URL, IAM Identity Center Assertion Consumer Service (ACS) URL (highlighted with a red box), and IAM Identity Center issuer URL. A button labeled “Download metadata file” is shown in the upper right.

Now go back to the PingIdentity browser tab and Navigate to the Configuration tab and select pencil icon to edit the details.
Paste the ACS URL you copied from the IAM identity center console and choose Save.

Screenshots showing the configuration and editing of SAML settings for AWS Single Sign-On in PingIdentity. The first image displays the static configuration view, listing the ACS URL, signing key ("PingOne SSO Certificate for Administrators environment"), signing method ("Response"), and signing algorithm. The second image shows the editable configuration screen with the ACS URL input field highlighted in red, alongside dropdowns for selecting the signing key, options for signing method (Assertion, Response, or both), and the RSA_SHA256 signing algorithm. These screens guide users through setting up secure SAML integration with AWS SSO.

Figure 8 – Configuring AWS Single Sign-On SAML Settings in PingIdentity console

Alt Text: Two screenshots showing the configuration and editing of SAML settings for AWS Single Sign-On in PingIdentity. The first image displays the static configuration view, listing the ACS URL, signing key (“PingOne SSO Certificate for Administrators environment”), signing method (“Response”), and signing algorithm. The second image shows the editable configuration screen with the ACS URL input field highlighted in red, alongside dropdowns for selecting the signing key, options for signing method (Assertion, Response, or both), and the RSA_SHA256 signing algorithm. These screens guide users through setting up secure SAML integration with AWS SSO.

Step 3: Configure PingIdentity as external IdP in IAM identity Center

This step involves setting up PingIdentity as an external IdP in IAM Identity Center to enable federated access.

Navigate back to the previous browser tab where you had IAM Identity Center console open.
Upload the downloaded PingIdentity IdP SAML metadata file from step 3 of previous section and select Next.

Screenshot of the AWS Identity Center configuration screen where the user uploads the IdP SAML metadata XML file. The metadata file is shown as successfully selected. Below are empty fields for optional manual entry of IdP sign-in URL, IdP issuer URL, and IdP certificate. The "Next" button is highlighted in orange at the bottom right, indicating the next step in the setup process.

Figure 9 – AWS IAM Identity Center metadata

Alt Text: Screenshot of the AWS Identity Center configuration screen where the user uploads the IdP SAML metadata XML file. The metadata file is shown as successfully selected. Below are empty fields for optional manual entry of IdP sign-in URL, IdP issuer URL, and IdP certificate. The “Next” button is highlighted in orange at the bottom right, indicating the next step in the setup process.

Review the list of changes. Once you are ready to proceed, type ACCEPT, then select Change identity source.

Step 4: Enable provisioning and identity-aware sessions in IAM identity Center

This step involves configuring user provisioning and enabling identity-aware sessions in AWS IAM Identity Center to support dynamic access control.

In IAM Identity Center Console, Choose Settings in the left navigation pane.
On the Settings page, locate and enable automatic provisioning. This immediately enabled automatic provisioning in IAM Identity Center and displays the necessary SCIM endpoint and access token information.
In the Inbound automatic provisioning dialog box, copy each of the values for the following options. You will need to paste these later when you configure provisioning in PingIdentity.
- SCIM endpoint
- Access token
Choose Close.
Next enable identity-aware sessions and automatic provisioning.

Two options are displayed for further configuration: "Enable identity-aware sessions" and "Automatic provisioning." Both options have an "Enable" button on the right-hand side, highlighted in red.

Figure 10 – IAM Identity Center Settings for identity aware sessions and automatic provisioning

Alt Text: Two options are displayed for further configuration: “Enable identity-aware sessions” and “Automatic provisioning.” Both options have an “Enable” button on the right-hand side, highlighted in red.

Step 5: Configure connections provisioning in PingIdentity

This step involves setting up connection provisioning in PingIdentity to enable automatic user and group management.

In the PingIdentity console, Navigate to the Integrations > Provisioning.
Select plus icon > New Connection
Under connection type Select Identity Store.

PingIdentity Provisioning configuration screen. The left sidebar highlights the "Provisioning" tab. The main panel shows the "Create a New Connection" dialog with two connection type options: "Identity Store" and "Gateway." The "Identity Store" option is selected using the "Select" button on the right. A plus (+) icon at the top indicates the option to add a new provisioning connection.

Figure 11 – PingIdentity connection provisioning

Alt Text: PingIdentity Provisioning configuration screen. The left sidebar highlights the “Provisioning” tab. The main panel shows the “Create a New Connection” dialog with two connection type options: “Identity Store” and “Gateway.” The “Identity Store” option is selected using the “Select” button on the right. A plus (+) icon at the top indicates the option to add a new provisioning connection.

Select SCIM outbound from the list of options and select Next.
Provide a name for the connection and select Next.
Paste the SCIM endpoint URL into the SCIM BASE URL field.
Navigate to Authentication Method and select OAuth 2 Bearer Token.
Paste the Access token into the Oauth Access Token field.
Select Test Connection to validate the connectivity and select Next.

PingIdentity interface showing the "Configure Authentication" step in the "Create a New Connection" wizard. Key fields include the SCIM Base URL, SCIM Version (2.0), Authentication Method (OAuth 2 Bearer Token), OAuth Access Token (obscured), and resource paths for Users and Groups. The "Test Connection" and "Next" buttons are visible at the bottom.

Figure 12 – Configure authentication details

Alt Text: PingIdentity interface showing the “Configure Authentication” step in the “Create a New Connection” wizard. Key fields include the SCIM Base URL, SCIM Version (2.0), Authentication Method (OAuth 2 Bearer Token), OAuth Access Token (obscured), and resource paths for Users and Groups. The “Test Connection” and “Next” buttons are visible at the bottom.

Navigate to User Filter Expression and change to userName Eq “%s”.
Choose Save. By default, the connection is created in a Disabled state.

Final step in the PingIdentity "Create a New Connection" wizard showing the "Configure Preferences" screen. The highlighted fields include "User Filter Expression" with the value userName Eq "%s", "User Identifier" set to userName, and group membership handling options ("Merge" and "Overwrite" with "Overwrite" selected). A "Save" button is highlighted at the bottom right.

Figure 13 – Edit UserFilter Expressions for the connection

Alt Text: Final step in the PingIdentity “Create a New Connection” wizard showing the “Configure Preferences” screen. The highlighted fields include “User Filter Expression” with the value userName Eq “%s”, “User Identifier” set to userName, and group membership handling options (“Merge” and “Overwrite” with “Overwrite” selected). A “Save” button is highlighted at the bottom right.

Select the connection you created and select the toggle switch to enable the connection.

PingIdentity configuration screen showing the IAM Identity Store integration. The page displays the identity store name, and tabs for "Overview" and "Configuration." A toggle switch in the top-right corner is highlighted, indicating the integration is currently enabled.

Figure 14 – Enable the connection

Alt Text: PingIdentity configuration screen showing the IAM Identity Store integration. The page displays the identity store name, and tabs for “Overview” and “Configuration.” A toggle switch in the top-right corner is highlighted, indicating the integration is currently enabled.

Step 6: Configure rules provisioning in PingIdentity

This step involves setting up provisioning rules in PingIdentity to define how users and groups are synchronized.

In the PingIdentity console, Navigate to the Integrations > Provisioning.
Select plus icon > New Rule
Provide a Name and Description for the rule.
Choose Create.
Select plus icon to select the Connection you created in the previous step.
Choose Save.

Figure 15 – Add the IAM identity center connection to the rule

Alt Text: Screenshots showing the final steps in connecting the IAM Identity Center to the IAM identity store using PingIdentity. The first image shows the IAM Identity Store connection listed under “Available Connections” with a plus (+) icon to initiate the link. The second image shows the selected connection from the PingOne Directory (P1) as the source and IAM identity store (SCIM) as the target, with the option to “Save” the configuration.

If you want to sync users from your PingIdentity directory, create a user filter. To do so, navigate to User Filter and select pencil icon to edit the settings.
Choose the appropriate filter from the drop down based on your use case and select Save. I have chosen Group Name which has been designated for Amazon Q Developer access.

Screenshot of the "Edit User Filter" interface in IAM Identity Center. The user filter is configured to provision users who belong to a group with names that contain "Amazon Q Developer." The condition logic is set to match if "Any" of the conditions are true.

Figure 16 – PingIdentity user filter

Alt Text: Screenshot of the “Edit User Filter” interface in IAM Identity Center. The user filter is configured to provision users who belong to a group with names that contain “Amazon Q Developer.” The condition logic is set to match if “Any” of the conditions are true.

If you want to sync a group from your PingIdentity directory, create group provisioning. To do so, navigate to Group Provisioning and select pencil icon to edit the settings.
Select the appropriate group which has been designated for Amazon Q Developer access and choose Save.

Screenshot of the "Edit Group Provisioning" screen in IAM Identity Center. The group "Amazon Q Developer" is selected for outbound provisioning. A "Save" button is highlighted in the bottom-left corner.

Figure 17 – PingIdentity Group Provisioning

Alt Text: Screenshot of the “Edit Group Provisioning” screen in IAM Identity Center. The group “Amazon Q Developer” is selected for outbound provisioning. A “Save” button is highlighted in the bottom-left corner.

Navigate to Attribute Mapping and select the pencil icon to edit the settings.
Delete the PingOne Directory attribute Primary Phone.
Add a new attribute and select Username as PingOne Directory and displayName as IAM identity Store.
Choose Save.

Two screenshots showing the editing of attribute mappings in IAM Identity Center. The first image displays default mappings such as 'Email Address' to 'workEmail' and 'Username' to 'userName', with an option to delete or update each field. The second image shows the addition of a new attribute mapping from 'Username' to 'displayName', along with highlighted 'Add' and 'Save' buttons.

Figure 18 – PingIdentity attribute mapping

Alt Text: Two screenshots showing the editing of attribute mappings in IAM Identity Center. The first image displays default mappings such as ‘Email Address’ to ‘workEmail’ and ‘Username’ to ‘userName’, with an option to delete or update each field. The second image shows the addition of a new attribute mapping from ‘Username’ to ‘displayName’, along with highlighted ‘Add’ and ‘Save’ buttons.

Select the rule you created and select the toggle switch to enable the rule.
This automatically provisions the users/groups from PingIdentity to IAM identity Center using SCIM.

IAM Identity Center sync summary showing successful user and group provisioning. The first image highlights two users impacted and successfully synced. The second image highlights one group impacted and successfully synced. Sync status is marked 'ACTIVE' in both views, confirming successful integration between PingOne and AWS IAM Identity Center.

Figure 19 – PingIdentity Users and Groups Sync status using SCIM

Alt Text: IAM Identity Center sync summary showing successful user and group provisioning. The first image highlights two users impacted and successfully synced. The second image highlights one group impacted and successfully synced. Sync status is marked ‘ACTIVE’ in both views, confirming successful integration between PingOne and AWS IAM Identity Center.

Step 7: Provide access to Amazon Q Developer

This step involves locating and subscribing the groups that need permission to use Amazon Q Developer.

In the Amazon Q Developer console, under Subscriptions add the IAM identity center groups which require access to Amazon Q Developer.
Select Subscribe and search for the group name.
Select Assign.

Screenshot of the Amazon Q Developer Subscriptions page in the AWS Management Console. The "Groups" tab is selected, displaying “Amazon Q Developer,” with a subscription status of “Subscribed.” The “Amazon Q Developer” group is highlighted with a red box.

Figure 20 – Amazon Q Developer subscriptions page

Alt Text: Screenshot of the Amazon Q Developer Subscriptions page in the AWS Management Console. The “Groups” tab is selected, displaying “Amazon Q Developer,” with a subscription status of “Subscribed.” The “Amazon Q Developer” group is highlighted with a red box.

Setup Amazon Q Developer with IAM Identity Center

This section guides you through installing the Amazon Q Developer extension and setting up authentication with IAM Identity Center.

To set up Amazon Q Developer extension in your integrated development environment (IDE), complete the steps in AWS documentation.
Once extension is installed Choose Amazon Q icon in your IDE.
Choose a sign-in option.
Select Use with Pro license and choose
Continue.
Provide the Start URL. You can retrieve this AWS access portal URL from the IAM Identity Center Console.

Screenshot of the IAM Identity Center settings page in the AWS Console, displaying the identity source configuration. It shows that the identity source is set to "External identity provider" with SAML 2.0 authentication and SCIM provisioning. The highlighted section includes the AWS access portal URL and the Identity Store ID. The "Settings" tab is selected in the left navigation pane.

Figure 21 – IAM identity center access portal URL

Alt Text: Screenshot of the IAM Identity Center settings page in the AWS Console, displaying the identity source configuration. It shows that the identity source is set to “External identity provider” with SAML 2.0 authentication and SCIM provisioning. The highlighted section includes the AWS access portal URL and the Identity Store ID. The “Settings” tab is selected in the left navigation pane.

Provide the region that hosts the identity directory and choose Continue
Select Open on the resulting pop up which redirects to your browser.
The browser redirects you to the Pingone URL where you enter your PingIdentity credentials and select Sign On.
Upon successful authentication, select Allow access on the resulting pop up to login successfully.

A screen recording of Visual Studio Code where the user selects the Amazon Q icon from the sidebar. The screen transitions to a login prompt indicating that the user must authenticate using their PingIdentity credentials via IAM Identity Center before accessing Amazon Q Developer features. The message highlights that authentication is required to continue.

Figure 22 – Setup Visual Studio Code Amazon Q Developer extension

Alt Text: A screen recording of Visual Studio Code where the user selects the Amazon Q icon from the sidebar. The screen transitions to a login prompt indicating that the user must authenticate using their PingIdentity credentials via IAM Identity Center before accessing Amazon Q Developer features. The message highlights that authentication is required to continue.

Test Configuration

Upon successfully completing the previous step, you can now leverage the code suggestions by Amazon Q Developer.

Figure 23 – Amazon Q Developer example

Alt Text: A screen recording of Visual Studio Code where Amazon Q Developer generates a sample code inline.

Clean Up

To avoid ongoing charges after testing this solution, follow these steps to remove all provisioned resources:1. Remove PingIdentity Application Configuration

In the PingIdentity console, navigate to Applications.
Locate and delete the AWS Single Sign-On application that was configured for IAM Identity Center integration.

2. Reset IAM Identity Center Configuration

In the AWS IAM Identity Center console:
- Navigate to Settings > Identity source.
- Change the identity source back to the default IAM Identity Center directory if no longer using PingIdentity.
- Remove any external metadata and configuration uploaded during the setup.

3. Revoke Subscriptions and Access

In the Amazon Q Developer console:
- Go to Subscriptions and remove assigned groups such as Amazon Q Developer or code whisperer trial.
- This will deactivate access and prevent any future charges tied to those subscriptions.

4. Remove Amazon Q Developer Extension

If desired, uninstall the Amazon Q Developer extension from Visual Studio Code to fully revert the development environment.

Conclusion

In this post, we demonstrated how to use existing PingIdentity credentials to access Amazon Q Developer through integration with IAM Identity Center. We provided a step-by-step guide for configuring PingIdentity as an external identity provider (IdP) with IAM Identity Center. Lastly, we demonstrated how to connect Amazon Q Developer extension within your IDE to AWS using your PingIdentity credentials, allowing seamless access to Amazon Q Developer.If you have any comments or questions, share them in the comments section.

To learn more about AWS Services

Amazon Q Developer

IAM Identity Center

AWS Toolkit for Visual Studio Code

About the author

Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.

Build a multi-Region analytics solution with Amazon Redshift, Amazon S3, and Amazon QuickSight

2025-06-19 Donatas Kuchalskis

Post Syndicated from Donatas Kuchalskis original https://aws.amazon.com/blogs/big-data/build-a-multi-region-analytics-solution-with-amazon-redshift-amazon-s3-and-amazon-quicksight/

Organizations increasingly face complex requirements balancing regional data sovereignty with global analytics needs. Regulatory frameworks like GDPR, HIPAA, and local data protection laws often mandate storing data in specific geographic regions, and business operations require global teams to access and analyze this data efficiently.

This post explores how to effectively architect a solution that addresses this specific challenge: enabling comprehensive analytics capabilities for global teams while making sure that your data remains in the AWS Regions required by your compliance framework. We use a variety of AWS services, including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon QuickSight.

It’s important to note that this solution focuses primarily on data residency (where data is stored) and not on preventing data from being in transit between Regions. Organizations with strict data transit restrictions might need additional controls beyond what’s covered here. We show how you can configure AWS across Regions to help meet business needs and regulatory requirements simultaneously.

Cross-Region architecture requirements

Before implementing a cross-Region solution, it’s important to understand when this approach is actually necessary. Although single-Region deployments offer simplicity and cost advantages, several specific business and regulatory scenarios warrant a cross-Region approach:

Data sovereignty and residency requirements – When regulations like GDPR, HIPAA, or local data sovereignty laws require data to remain in specific geographic boundaries while still enabling global analytics capabilities
Global operations with local compliance – When your organization operates globally, but needs to adhere to regional compliance frameworks while maintaining unified analytics
Performance optimization for global users – When your organization needs to optimize analytics performance for users in different geographic areas while centralizing data governance
Enhanced business continuity – When your analytics capabilities need higher availability and Regional redundancy to support mission-critical business processes

Use case: Financial services analytics with Regional data residency

Consider a financial services company with the following business and regulatory requirements:

Data residency requirement – All customer financial data must remain in the Bahrain Region (me-south-1) to comply with local financial regulations.
Global analytics capability – The organization’s data science team operates from European offices and needs to access and analyze the financial data without moving it out of its mandated storage Region.
Advanced analytics requirements – Business leaders need interactive data exploration and natural language query capabilities to derive insights from financial data.
Performance requirement – Specific dashboard queries require subsecond response times for both local executives and the global management team.

This specific combination of requirements can’t be met with a single-Region deployment. Let’s explore how to architect a solution.

Solution overview

The following architecture is designed to address the specific challenge of using QuickSight in one Region while maintaining data in another Region.

As shown in the architecture diagram, data engineers based in Bahrain (me-south-1) work with local data, whereas data engineers in Stockholm (eu-north-1) and analysts in Ireland (eu-west-1) can securely access the same data through Redshift datashares and virtual private cloud (VPC) peering connections. This approach maintains data residency in me-south-1 while enabling global access.

The solution consists of the following key components:

Primary data Region (me-south-1):
- Redshift cluster (primary data repository)
- S3 buckets for data lake storage
- Private and public subnets with appropriate security controls
- Data must remain in this Region for compliance reasons
Analytics services Region (eu-west-1):
- QuickSight deployment
- Cross-Region VPC peering connection to the primary Region
- Data access using Redshift datashares (no data replication)
Data engineering Region (eu-north-1):
- Redshift consumer cluster for data engineering workloads
- Data access using Redshift datashares from me-south-1
- Makes it possible for data engineering teams in eu-north-1 to access and work with data while maintaining compliance

Before implementing this architecture, evaluate whether:

Your requirements actually necessitate a cross-Region approach
The performance impact is acceptable for your use case
The additional cost is justified by your business requirements

For most analytics workloads, a single-Region architecture remains the recommended approach for simplicity, performance, and cost-effectiveness. Consider cross-Region architectures only when specific business and compliance requirements make them necessary.

Establish cross-Region network connectivity: Amazon Redshift to QuickSight

The foundation of a cross-Region solution is secure, reliable network connectivity. VPC peering provides a straightforward approach for connecting VPCs across Regions. To implement VPC peering in Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

Create a new VPC in the secondary Region (eu-west-1):
1. Open the Amazon VPC console in the eu-west-1 Region.
2. Choose Create VPC.
3. Set IPv4 CIDR block to 172.32.0.0/16 (verify there is no overlap with the primary Region VPC).
4. Select Auto-generate to create subnets automatically within this new VPC.
5. Leave other settings as default and choose Create VPC.

Set up VPC peering:
1. On the Amazon VPC console, choose Peering connections in the navigation pane and choose Create peering connection.
2. Select the new eu-west-1 VPC as the requester.
3. For Select another VPC to peer with, select My account and Another Region.
4. Choose the primary Region (me-south-1) and enter the VPC ID.
5. Choose Create peering connection.

Accept the VPC peering connection:
1. Switch to the primary Region on the Amazon VPC console.
2. Choose Peering connections in the navigation pane and select the pending connection.
3. On the Actions dropdown menu, choose Accept request.

Update the route tables:
1. On the secondary Region Amazon VPC console, choose Route tables in the navigation pane.
2. Choose the route table for the new VPC.
3. Choose Edit routes and add a new route:
  - Destination: Primary Region VPC CIDR (e.g., 172.31.0.0/16).
  - Target: Choose the peering connection.
4. On the primary Region Amazon VPC console, repeat the process, adding a route to the secondary Region VPC CIDR (172.32.0.0/16) using the peering connection.

Configure security groups:
1. On the secondary Region Amazon VPC console, choose Security groups in the navigation pane and create a new security group.
2. Add an outbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Destination: Primary Region VPC CIDR
3. On the primary Region Amazon VPC console, locate the Redshift cluster’s security group.
4. Add an inbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Source: Secondary Region VPC CIDR

Configure DNS settings:
1. On the Amazon VPC console for both Regions, choose Your VPCs in the navigation pane.
2. Select each VPC, and on the Actions dropdown menu, choose Edit DNS hostnames.
3. Select Enable DNS resolution and Enable DNS hostnames.

Implement cross-Region data sharing

Rather than replicating data, which could create compliance issues, you can use Redshift datashares to provide secure, read-only access to data across Regions. Complete the following steps to set up your datashares:

Create producer datashares in the primary Region:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane to connect to your primary Region Redshift cluster (me-south-1).

Run the following commands:

-- In Primary Region Redshift

CREATE DATASHARE datashare_1;
ALTER DATASHARE datashare_1 ADD SCHEMA analytics;
ALTER DATASHARE datashare_1 ADD TABLE analytics.customers;
ALTER DATASHARE datashare_1 ADD TABLE analytics.transactions;

-- Grant usage permissions
	
GRANT USAGE ON DATASHARE datashare_1 TO ACCOUNT '123456789012';

Create a consumer database in the secondary Region:

Connect to your secondary Region Redshift cluster (eu-west-1) using the query editor and run the following commands:

-- In Secondary Region Redshift

CREATE DATABASE consumer_db FROM DATASHARE datashare_1 OF ACCOUNT '123456789012'REGION 'me-south-1';

Verify the datashare configuration with the following code:

-- In Secondary Region Redshift

SELECT * FROM SVV_DATASHARE_CONSUMERS;
SELECT * FROM SVV_DATASHARE_OBJECTS;

This approach maintains data residency in the primary Region while enabling analytics access from another Region, addressing the core challenge of Regional service limitations. For our financial services company example, this makes sure that customer financial data remains in Bahrain (me-south-1) while making it securely accessible to the data science team in Europe (eu-west-1).

Configure QuickSight in the analytics Region

With network connectivity and data sharing established, complete the following steps to configure QuickSight to securely access the Redshift data:

Set up a QuickSight VPC connection:
1. Open the QuickSight console in the secondary Region.
2. Choose Manage QuickSight, VPC connections, and Add VPC connection.
3. Configure the connection:
  - Name: Enter a name (for example, Cross-Region-Connection).
  - VPC: Choose the secondary Region VPC.
  - Subnet: Choose the automatically created subnets.
  - Security group: Choose the security group created for cross-Region access.

Add a QuickSight IP range to the data source security group:
1. Open the Amazon Elastic Compute Cloud (Amazon EC2) console in the primary Region.
2. Choose Security groups in the navigation pane and find the security group for your data source.
3. Edit the inbound rules.
4. Add a new rule:
  - Type: HTTPS (443)
  - Protocol: TCP
  - Port range: 443
  - Source: QuickSight IP range for the secondary Region (for example, 52.210.255.224/27 for eu-west-1).

QuickSight IP ranges can change over time. Refer to AWS Regions, websites, IP address ranges, and endpoints for current IP ranges.

Create a QuickSight data source:
1. On the QuickSight console, choose Datasets in the navigation pane.
2. Choose New dataset, then choose Redshift.
3. Configure the connection:
  - Data source name: Enter a descriptive name.
  - Connection type: Choose the VPC connection.
  - Database server: Enter the Redshift cluster endpoint from the primary Region.
  - Port: 5439
  - Database name: Enter the consumer database name.
  - Username and Password: Enter credentials (consider using AWS Secrets Manager).
4. Choose Validate connection to test.
5. Choose Create data source.

Verify the connection and create datasets:
1. Choose the schema and tables from the consumer database.
2. Configure appropriate refresh schedules.
3. Create calculations and visualizations as needed.

Performance considerations for cross-Region analytics

When implementing a cross-Region analytics architecture, be aware of the following performance implications:

Query performance impact – Cross-Region queries can experience higher latency than single-Region queries. To mitigate this, consider the following:
- Use SPICE for QuickSight – Import frequently-used datasets into SPICE (Super-fast, Parallel, In-memory Calculation Engine) to help avoid repeated cross-Region queries. SPICE is the QuickSight in-memory engine that enables fast, interactive visualizations by precomputing and storing datasets locally in the QuickSight Region.
- Implement efficient query patterns – Minimize the amount of data transferred between Regions.
- Use appropriate caching – Enable result caching where possible.
- Monitoring cross-Region performance – Implement monitoring to identify and address performance issues:
  - Set up Amazon CloudWatch metrics to track cross-Region query performance
  - Create dashboards to visualize latency trends
  - Establish performance baselines and alerts for degradation

Security considerations

Maintaining security in a cross-Region architecture requires additional attention:

Network security:
- Limit VPC peering connections to only necessary VPCs
- Implement restrictive security groups that allow only required traffic
- Consider using VPC endpoints for service access when possible
Data access controls:
- Use AWS Identity and Access Management (IAM) policies consistently across Regions
- Implement fine-grained access controls in Redshift datashares
- Enable audit logging in relevant Regions
Compliance monitoring:
- Implement AWS CloudTrail in all Regions
- Create centralized logging for cross-Region activities
- Regularly review cross-Region access patterns

Cost implications

Before implementing a cross-Region architecture, consider these cost factors:

Data transfer costs – Data transfer between Regions incurs charges
Additional infrastructure – You might need Redshift clusters in multiple Regions
VPC peering costs – Data transfer costs are associated with VPC peering
Operational overhead – Managing multi-Region deployments requires additional resources
Workload-based sizing – You should size each Regional Redshift cluster according to the specific workloads it will handle

Conclusion

The cross-Region architecture described in this post addresses specific challenges related to Regional compliance requirements and global analytics needs, particularly in the following scenarios:

Your data must remain in a specific Region for compliance reasons
You have teams in different Regions who need to access and analyze this data
Different user groups have distinct workload requirements

The datasharing capabilities of Amazon Redshift and Regional storage options in Amazon S3 are key enablers of this solution, allowing data to remain in the required Region while still being accessible for analytics across Regions. However, it’s worth emphasizing that this architecture supports data storage in specific Regions but doesn’t prevent data from traveling between Regions during processing. Organizations concerned about data transit restrictions should evaluate additional controls to address those specific requirements. Combined with secure VPC peering connections and QuickSight visualizations, this architecture creates a complete solution that satisfies both compliance requirements and business needs.

For our financial services example, this architecture successfully enables the company to keep its customer financial data in Bahrain while providing seamless analytics capabilities to the European data science team and delivering interactive dashboards to global business leaders.

For more information, refer to Building a Cloud Security Posture Dashboard with Amazon QuickSight. For hands-on experience, explore the Amazon QuickSight Workshops. Visit the Amazon Redshift console or Amazon QuickSight console to start building your first dashboard, and explore our AWS Big Data Blog for more customer success stories and implementation patterns

Try out this solution for your own use case, and share your thoughts in the comments.

About the Authors

Donatas Kuchalskis is a Cloud Operations Architect at AWS, based in London, focusing on Financial Services customers in the UK. He helps customers optimize their AWS environments for cost, security, and resiliency while providing strategic cloud guidance. Prior to this role, he served as a Prototyping Architect specializing in Big Data and as a Specialist Solutions Architect for Retail. Before joining AWS, Donatas spent 6 years as a technical consultant in the retail sector.

Jumana Nagaria is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

How to prioritize security risks using AWS Security Hub exposure findings

2025-06-19 Shahna Campbell

Post Syndicated from Shahna Campbell original https://aws.amazon.com/blogs/security/how-to-prioritize-security-risks-using-aws-security-hub-exposure-findings/

At re:Inforce 2025, AWS unveiled an enhanced AWS Security Hub that transforms how organizations prioritize their most critical security issues and respond at scale to protect their cloud environments. In this blog post, we discuss how you can use Security Hub to prioritize these issues with exposure findings. The enhanced Security Hub now uses advanced analytics to automatically correlate, enrich, and prioritize security signals across your cloud environment. Security Hub seamlessly integrates with Amazon GuardDuty, Amazon Inspector, Amazon Macie, and AWS Security Hub Cloud Security Posture Management (CSPM), formerly known as AWS Security Hub. Through these integrations, it provides comprehensive threat detection and vulnerability assessment. This intelligent integration helps organizations quickly identify critical security issues, from potential credential compromises to unintended resource exposures, enabling security teams to focus on what matters most.

What is Security Hub?

Security Hub delivers three key security capabilities to help you strengthen your cloud security posture through a unified cloud security solution:

Provides visibility across your organization through centralized management and continuous monitoring.
Enriches security signals from services such as Amazon Inspector and AWS Security Hub CSPM to surface active risks specific to your environment, so you can prioritize with confidence and streamline response.
Delivers integrated risk analysis by correlating findings from Amazon Inspector, AWS Security Hub CSPM, Amazon Macie, and other AWS services to help identify potential attack paths, surface exploitable vulnerabilities and misconfigurations, and provide actionable remediation guidance.

A top concern for customers is: How do I know where to prioritize response first? Managing large volumes of findings across multiple accounts and regions becomes more challenging when security findings are viewed in isolation, making it difficult to determine true priority and impact. Security Hub solves this by providing context-driven analysis. It surfaces the most critical risks by correlating related vulnerabilities, threats, and misconfigurations to reveal exploitable paths. This can help you make informed decisions about which issues to address first.

With exposure findings, you can prioritize critical security issues and respond at scale. Exposures are based on an analysis of findings and traits from Security Hub CSPM, Amazon Inspector (which scans for vulnerabilities), and Amazon Macie (which discovers and protects sensitive data). They are defined as potential security issues, and they are generated by diﬀerent exposure traits.

Without automated correlation and enriched signals, security teams can struggle to effectively prioritize issues. For example, a vulnerability that Amazon Inspector detects might become critically important when combined with misconfigurations that Security Hub CSPM identifies. However, manually analyzing relationships across thousands of signals is time-consuming and prone to missing critical security context. Teams often build custom solutions to achieve this, but this approach requires significant analyst time and maintenance, which can cause critical security relationships to be overlooked.

Security Hub reduces this complexity by providing native integration across these AWS services in a unified cloud security center, without the operational overhead of log collection and aggregation. For security teams, this means they can help identify and respond to their most critical exposures before the exposures can lead to business impact, rather than spending valuable time manually piecing together individual security signals. Automated correlation and enriched context can help you make faster, more informed decisions about where to focus your efforts. This ultimately helps protect your cloud environment more effectively.

Exposure findings identify security risks in your environment by providing a comprehensive view of your security posture. These findings enable you to understand and address potential risks. Through this consolidated approach, you can efficiently prioritize your remediation efforts by focusing on the most critical exposure findings first,.

Exposure findings are formatted in the Open Cybersecurity Schema Framework (OCSF) schema, an open-source standard that enables security tools to share data seamlessly. The adoption of OCSF by Security Hub has several advantages. As an open, standardized schema that is part of the Linux Foundation, OCSF enables interoperability across multiple security tools and services, both within and outside of the AWS environment. It provides enhanced data normalization with consistent field naming and categorization, making it more straightforward to integrate with third-party security tools.

Partners who already support or intend to support the OCSF schema to receive findings from Security Hub include companies such as Arctic Wolf; CrowdStrike; DataBee, a Comcast company; Datadog; DTEX Systems; Dynatrace; Fortinet; IBM; Netskope; Orca Security; Rapid7; Securonix; SentinelOne; Splunk, a Cisco Company; Sumo Logic; Tines; Trellix; and Wiz. Additionally, service partners such as Accenture, Caylent, Deloitte, IBM, and Optiv can help you adopt Security Hub and the OCSF schema.

Prioritizing security risks

When you navigate to Security Hub, you will see the summary dashboard, which includes an exposure summary widget, as shown in Figure 1. This widget shows your exposures by severity and frequency. Security Hub assigns each exposure finding a default severity of Critical, High, Medium, or Low. Exposure findings with a severity of Informational are not published.

Security Hub calculates exposure finding severity by analyzing and correlating multiple security traits across AWS services. Instead of evaluating these factors in isolation, Security Hub uses a contextual approach, assigning a severity rating based on how these factors are correlated. For example, a resource with an identified vulnerability might receive a higher severity rating if it’s exploitable from the internet or has access to sensitive data.

Security Hub uses several factors to determine the default severity of an exposure finding:

Ease of discovery – The availability of automated tools, such as port scans or internet searches to discover the resource at risk.
Ease of exploit – The ease with which a threat actor can exploit the risk. For example, if there are open network paths or misconfigured metadata, a threat actor can more quickly exploit the risk.
Likelihood of exploit – Security Hub uses both external signals, such as the Exploit Prediction Scoring System (EPSS)—a data-driven scoring system that estimates the probability of a vulnerability being exploited—and internal threat intelligence to determine the probability that the risk is exploited. This comprehensive approach applies to exposure findings for Amazon Elastic Compute Cloud (EC2) instances and AWS Lambda functions.
Awareness – The extent to which the risk is not merely theoretical but has publicly available or automated exploits. This factor applies to exposure findings for EC2 instances and Lambda functions.
Impact – The potential harm if the exploit is carried out. For example, an exposure could lead to loss of confidentiality from data exposure, loss of integrity from data corruption, loss of availability, or loss of accountability.

The list of risks in this widget is limited to the eight highest risks with the greatest number of critical findings. If two or more risks have an equal number of critical findings, the list automatically groups those findings behind more recent critical findings.

Figure 1 : Exposure summary widget

From the widget, you can pivot to the exposure dashboard to see to a pre-filtered view of your exposures for continued analysis of potential security issues. You can filter by severity by selecting the number associated with each severity, view a specific exposure by selecting from the list, or select View all exposure findings to see a dashboard of new exposures that are currently open, as shown in Figure 2.

Figure 2: Exposure dashboard

The exposure console shows findings by their title and ranked by decreasing severity. It’s organized by the filter criteria and grouped by finding title. On the left-hand side, Quick filters provide a fast way to filter through exposures based on severity, the top 10 attributes based on the most common values across your findings, top 10 accounts, and top 10 resource types, as shown in Figure 2. In addition to using filters, you can use the Group by dropdown to group exposure findings by a specific attribute, such as AWS account ID, resource type, or product name.

To review the exposure, expand the findings, as shown in Figure 3 for the correlation of resources, status, attributes, and traits such as software vulnerabilities, misconfigurations, and reachability. These are also referred to as trait types. For a particular exposure finding, a trait can be associated with one or more signals, and a signal can contain one or more indicators.

Figure 3: Exposure findings

As shown in Figure 3, the Potential Credential Stealing: Internet reachable EC2 instance with administrative instance profile has network-exploitable software vulnerabilities with a high likelihood of exploitation finding indicates that there are misconfigurations, vulnerabilities, and reachability (indication of an open network path to a resource) associated with the instance. To find out more about the signal, select anywhere in the line associated with the risk, and you will see an overview panel, as shown in Figure 4.

Figure 4: Exposure finding overview

This example highlights a critical-severity finding for an internet-reachable EC2 instance with software vulnerabilities in the us-east-1 Region. This visualization is powerful because the Potential attack path diagram helps you see what matters by mapping out how potential threat actors could exploit these vulnerabilities to access your resources. The finding also includes critical metadata such as the resource identifier, creation time, reachability, vulnerability, and misconfigurations.

Using the finding, you can quickly understand complex security relationships, assess risk context, and determine remediation priorities, so you can better protect your workloads in the cloud and make more informed security decisions. To prioritize your security response efforts, you can also set finding severity levels and update status, and export findings in OCSF format.

To understand why an exposure is present, you can select the Traits tab, as shown in Figure 5. This will list traits such as Misconfiguration or Vulnerability. If you select By signal, in the Traits tab, you have a full list of the signals associated with the exposure finding. These signals are the underlying findings that were created from different services, such as Security Hub CSPM and Amazon Inspector, that were correlated together to determine the risk associated with the exposure finding.

Figure 5: Exposure finding traits

If you select the Resources tab, you will see the resources associated with the exposure finding, as shown in Figure 6.

Figure 6: Exposure finding resources

For this example, we have an EC2 instance, but you might have a combination of resources such as an EC2 instance, Amazon Simple Storage Service (Amazon S3) bucket, and AWS Identity and Access Management (IAM) role. This list of resources will help you determine what needs to be remediated in your environment to mitigate the risk attributed to this finding.

Finally, with the Create ticket option, Security Hub helps streamline the incident management process through its native integrations with popular ticketing systems such as Jira and ServiceNow. This integration minimizes the need for manual ticket creation and reduces the time between finding and fixing security issues. Organizations can use a Security Hub Automation Rule to automatically create and track tickets for security findings directly from the Security Hub console, helping to make sure that no critical security exposure goes unaddressed. Integration with these widely-used ticketing systems helps maintain a consistent workflow, enables better tracking of remediation efforts, and improves collaboration between security and operations teams. This can help you make your security operations more efficient by providing a streamlined path from detection to resolution.

Conclusion

The enhanced exposure findings capabilities in Security Hub represent a significant advancement in how organizations can secure their cloud environments. By automatically correlating and analyzing security signals across multiple AWS services, Security Hub helps you prioritize your most critical security issues confidently and respond at scale. The intuitive visualization of potential attack paths, combined with intelligent severity rankings and comprehensive trait analysis, enables security teams to make data-driven decisions about risk prioritization.

Security Hub exposure findings help organizations move from reactive to proactive security postures by:

Automatically discovering and evaluating publicly accessible resources
Providing clear visibility into security capabilities and configurations
Correlating multiple security signals to identify critical risks
Delivering actionable remediation guidance
Offering intuitive filtering and grouping options for efficient analysis

As cloud environments continue to grow in complexity, exposure findings provide the automation, intelligence, and context needed to stay ahead of potential security issues. This enables security teams to focus their valuable time on addressing the most critical risks first, ultimately helping organizations maintain a stronger security posture across their cloud environment.

Whether you’re managing a small deployment or a large enterprise environment, exposure findings in Security Hub provide the insights needed to effectively protect your AWS workloads and maintain a robust security position in an ever-evolving landscape.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS Security, Identity, and Compliance re:Post or contact AWS Support.

Networking of Amazon MQ for RabbitMQ event source mapping for AWS Lambda

2025-06-18 Rafal Pawlaszek

Post Syndicated from Rafal Pawlaszek original https://aws.amazon.com/blogs/compute/networking-of-amazon-mq-for-rabbitmq-event-source-mapping-for-aws-lambda/

Event-driven architectures with message brokers need careful attention to security best practices. Amazon MQ for RabbitMQ combined with AWS Lambda enables serverless event processing. However, implementing defense in depth and least privilege principles necessitates a clear understanding of networking requirements. This is particularly important when working with different subnet types and their impact on service connectivity.

This post explores the networking aspects of Lambda event source mapping for Amazon MQ for RabbitMQ. Learn how deployment options influence your networking setup and security posture to make informed architectural decisions. These networking concepts are essential for building secure, scalable solutions, regardless of your experience level with message brokers.

For clarity in this post, when we refer to “RabbitMQ”, we mean Amazon MQ for RabbitMQ.

Prerequisites

The following prerequisites are necessary to complete this post:

An Amazon Web Services (AWS) account
Basic understanding of AWS networking concepts
Familiarity with Amazon MQ for RabbitMQ
Basic knowledge of Lambda

Furthermore, to enable setup of the discussed architectures, this post is accompanied by a GitHub repository that uses AWS Cloud Development Kit (AWS CDK).

Repository prerequisites

The following prerequisites are necessary for the repository:

Repository setup

Clone the https://github.com/aws-samples/sample-amazonmq-rabbitmq-lambda-esm repository. This repository contains all the necessary code and instructions to create relevant architectures using AWS CDK.

Install dependencies and build

Install the necessary NPM dependencies by running the following commands:

npm install
npm run build

Amazon MQ for RabbitMQ networking deployment options

Public accessibility is the primary networking differentiator when deploying a RabbitMQ broker in AWS. Although the broker operates in the Amazon MQ service account, the networking configuration varies based on this choice.

Public broker

When you deploy a publicly accessible broker, Amazon MQ provisions all networking components in the service account. The service provides a DNS name that resolves to an IP address of the Network Load Balancer (NLB) in that account. This configuration doesn’t support security groups. All security measures must be implemented through the RabbitMQ broker’s authentication and authorization mechanisms. The following diagram shows this communication flow.

Figure-1 DNS resolution of a public Amazon MQ for RabbitMQ broker.

Private broker

A private broker routes networking through a Amazon Virtual Private Cloud (Amazon VPC) in your account. Amazon MQ uses AWS PrivateLink to provision VPC Endpoints, which serve as entry points for broker communication.

The following diagram shows how client applications communicate with RabbitMQ:

The client application connects to Amazon Route 53 Resolver
Route 53 Resolver resolves the DNS name to the VPC Endpoint’s IP address
The client communicates with the broker through PrivateLink
Security groups protect the VPC Endpoint’s Elastic Network Interfaces (ENIs)

Figure-2 DNS resolution of a private Amazon MQ for RabbitMQ broker.

A private broker deployment offers two networking options:

Custom VPC configuration – Specify:
- Subnets for VPC Endpoint creation
- At least one security group to protect the VPC Endpoints
Default VPC configuration – Leave VPC options blank to use:
- Default VPC
- Default security group

Amazon MQ for RabbitMQ Lambda event source mapping building blocks

RabbitMQ solutions offer two approaches for message processing:

Create a custom client to read messages from broker queues
Use Lambda functions with event source mapping (ESM) for automated message retrieval

The ESM is a Lambda service resource that reads the messages from the broker and invokes the Lambda function synchronously. In the remainder of this post, we refer to this Lambda function as listener.

ESM connectivity depends on the following:

The listener’s AWS Identity and Access Management (IAM) Role for access permissions
RabbitMQ broker networking components

For public brokers, ESM uses public connectivity. For private brokers, ESM:

Assumes the listener’s IAM Role
Creates ENIs in the same subnets as the broker’s VPC Endpoints
Uses the same security groups that protect the VPC Endpoints

The listener’s IAM Role must include these Amazon Elastic Compute Cloud (Amazon EC2) permissions:

CreateNetworkInterface
DeleteNetworkInterface
DescribeNetworkInterfaces
DescribeSecurityGroups
DescribeSubnets
DescribeVpcs

To view ESM ENIs:

Open the AWS Management Console
Navigate to EC2 > Network Interfaces
Look for ENIs with the following naming pattern:
AWS Lambda VPC ENI-armq-<ACCOUNT_ID>-<ESM_ID>-<remainder>

where:
- ACCOUNT_ID – The AWS account number containing the ESM
- ESM_ID – The unique identifier of the ESM

The following image shows example ESM ENIs.

Figure-3 An example list of interfaces that Amazon MQ for RabbitMQ creates for private brokers.

Disabling or deleting the ESM removes the ESM components.

An enabled ESM needs connectivity to the following:

AWS Security Token Service: For IAM role assumption
AWS Secrets Manager: For RabbitMQ credentials
RabbitMQ broker: For queue access
AWS Lambda: For listener invocation

Because the ESM queue polling process follows these steps:

Assumes the listener’s IAM Role
Retrieves RabbitMQ credentials from Secrets Manager
Establishes broker communication
Invokes the listener when messages are present

You have two options to enable private broker connectivity to support the queue polling process:

Deploy VPC endpoints in ESM subnets for:
- AWS Security Token Service (AWS STS)
- Secrets Manager
- Lambda
Deploy NAT gateway in ESM subnets

ESM networking configuration options

The following sections detail ESM networking configurations for different deployment scenarios.

Option 1: Public broker

In this approach all network communication happens on the Amazon MQ service’s side. The ESM, when enabled, uses public connectivity.

To observe the architecture implemented in your account go to the cloned repository root location, make sure that you are signed in with AWS CLI and run the following:

cdk deploy PublicRabbitMqInstanceStack

Option 2: Private broker in a default VPC

Deploying a private RabbitMQ broker without specifying the VPC informs the Amazon MQ service to pick the default VPC for setting up the networking and then the public subnet(s) in that VPC. The default security group is used for securing the broker’s VPC Endpoints.

Creating the ESM provisions dedicated ENIs in the public subnets where the RabbitMQ broker’s VPC Endpoints reside with the default security group applied. The default security group allows itself for inbound traffic on all protocols and full port range, thus the ESM can route traffic through the VPC Endpoint.

Although the subnet is public with internet gateway access, the ESM ENIs operate in private address space, preventing direct communication with AWS services. To enable proper communication, create VPC Endpoints for AWS STS, Secrets Manager, and Lambda. These endpoints allow the ESM to communicate with AWS services through private IP addresses within your VPC. The following diagram shows the complete communication path from the ESM to the broker.

Figure-4 Networking configuration and request flow for a private broker provisioned in the default VPC.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following

cdk deploy PrivateRabbitMqInstanceDefaultVpcStack

Option 3: Private broker in a Custom VPC with NAT

When deploying a private RabbitMQ broker in a custom VPC, specify either a single subnet for a standalone broker or multiple subnets for a cluster deployment. The deployment also needs a security group for the VPC Endpoint ENIs.

Configure the security group with a self-referencing inbound rule on the AMQP port. This configuration enables communication between the ESM and the RabbitMQ VPC Endpoints’ ENIs.

The following diagram shows how ESM resources communicate through networking components when deployed in a private subnet with NAT gateway. This architecture demonstrates the complete communication path from the ESM to the broker.

Figure-5 Networking configuration and request flow for a private broker provisioned in a private VPC subnet with NAT.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcWithNatStack

Option 4: Private broker in a Custom VPC with isolated subnets

This configuration builds upon the previous architecture but introduces isolated subnets. These subnets restrict all internet connectivity, permitting only internal VPC network traffic. Although the broker networking components mirror Option 3, the isolation introduces more considerations.

The security group still needs an open AMQP port for queue operations, but the subnet isolation prevents the ESM from directly accessing AWS services. To address this limitation, deploy VPC Endpoints for AWS STS, Secrets Manager, and Lambda within the isolated subnets. These endpoints create a private communication path for the ESM to interact with essential AWS services without needing internet access.

The following diagram shows the communication architecture for ESM resources deployed in isolated subnets. It demonstrates how VPC Endpoints enable secure communication between the ESM and AWS services while maintaining network isolation. This architecture makes sure that the ESM can fulfill its message processing responsibilities without compromising security through internet exposure.

Figure-6 Networking configuration and request flow for a private broker provisioned in an isolated VPC subnet.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcIsolatedSubnetStack

Option 5: Private broker in a Custom VPC with public subnets

The final configuration places the broker in public subnets while maintaining the core deployment requirements from the previous options. Despite the public subnet placement, the ESM’s networking behavior presents an important consideration: ESM ENIs operate in private address space, preventing direct internet communication even with an internet gateway present.

This architecture necessitates VPC Endpoints for AWS service communication, similar to Option 2. Any attempts to route ESM traffic through the internet gateway fail because the ENIs operate in private address space. Understanding this limitation is crucial for proper deployment planning.

The following diagram shows the ESM communication architecture in public subnets. Despite the different subnet type, this configuration mirrors the isolated subnet approach in its use of VPC Endpoints. These endpoints enable the ESM to communicate with AWS STS, Secrets Manager, and Lambda services through private, secure connections within the VPC.

Figure-7 Networking configuration and request flow for a private broker provisioned in a public VPC subnet.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcPublicSubnetStack

Cleaning up

To prevent unexpected AWS charges, remove resources you’ve created. The following AWS CDK command helps you safely remove all deployed resources:

cdk destroy --all

Conclusion

This post explored the relationship between AWS Lambda event source mapping and RabbitMQ networking configurations. We examined various deployment scenarios, from public brokers to isolated subnets, each presenting unique considerations for secure and effective implementation.

Understanding these networking patterns enables you to make informed architectural decisions when deploying Amazon MQ for RabbitMQ with Lambda event source mapping. Whether choosing public accessibility or implementing private networking with VPC Endpoints, understanding the consequences of choosing specific networking configurations allows you to apply security best practices while meeting your application’s messaging needs. As you implement these patterns, consider your specific security requirements and operational needs to choose the most appropriate configuration for your use case.

Take the next step in optimizing your serverless messaging architecture. Dive in to the AWS documentation, experiment with the RabbitMQ and Lambda integration patterns discussed, and discover how these networking configurations can elevate the security and performance of your own applications. Start implementing these strategies today to build more robust, scalable solutions.

Improve your security posture using Amazon threat intelligence on AWS Network Firewall

2025-06-17 Amit Gaur

Post Syndicated from Amit Gaur original https://aws.amazon.com/blogs/security/improve-your-security-posture-using-amazon-threat-intelligence-on-aws-network-firewall/

Today, customers use AWS Network Firewall to safeguard their workloads against common security threats. However, they often have to rely on third-party threat feeds and scanners that have limited visibility in AWS workloads to protect against active threats. A self-managed approach to cloud security through traditional threat intelligence feeds and custom rules can result in delayed responses, leaving customers exposed to active threats that are relevant to AWS workloads. Customers are looking for an automated approach to analyzing threats and deploying mitigations across multiple enforcement points to establish consistent defenses and want a unified, AWS-native solution that can rapidly protect against active threats across their entire cloud infrastructure.

This post introduces active threat defense, a new Network Firewall managed rule group that offers protection against active threats relevant to workloads in AWS. Active threat defense uses the AWS global infrastructure visibility and extensive threat intelligence to deliver automated, intelligence-driven security measures. The feature uses the Amazon threat intelligence system MadPot, which continuously tracks attack infrastructure, including malware hosting URLs, botnet command and control servers, and crypto mining pools, identifying indicators of compromise (IOCs) for active threats.

Active threat defense comes as a rule group AttackInfrastructure, which protects against malicious network traffic by blocking communications with detected attack infrastructure. After the managed rule group is configured in your firewall policy, Network Firewall now automatically blocks suspicious traffic to malicious IPs, domains, and URLs for indicator categories such as command-and-control (C2s), malware staging hosts, sinkholes, out-of-band testing (OAST), and mining-pools. It implements comprehensive filtering of both inbound and outbound traffic for various protocols, including TCP, UDP, DNS, HTTPS, and HTTP, and uses specific, verified threat indicators to facilitate high accuracy and minimize false positives.

Network Firewall with active threat defense protects AWS workloads using the following mechanisms:

Threat prevention: Automatically blocks malicious traffic using Amazon threat intelligence to identify and prevent active threats targeting workloads in AWS
Rapid protection: Continuously updates Network Firewall rules based on newly discovered threats, enabling immediate protection against them
Streamlined operations: Findings in GuardDuty marked with the threat list name “Amazon Active Threat Defense” can now be automatically blocked when active threat defense is enabled on Network Firewall
Collective defense: Deep threat inspection (DTI) enables shared threat intelligence, improving protection for active threat defense managed rule group users

Figure 1 illustrates the use of the active threat defense managed rule group with Network Firewall. It shows the automatic creation of stateful rules in the AWS managed rule group using threat data collected from MadPot.

Figure 1: Network Firewall with active threat defense

Getting started

The active threat defense managed rule group can be enabled directly within Network Firewall using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK. You can then associate the managed rule group with the Network Firewall policy. The rule group receives regular updates with new threat indicators and signatures, while automatically removing inactive or aged-out signatures.

Prerequisites

To get started with Network Firewall with active threat defense, visit the Network Firewall console or see the AWS Network Firewall Developers Guide. Active threat defense is supported in all AWS Regions where Network Firewall is available today, including the AWS GovCloud (US) Regions and China Regions.

If this is your first time using Network Firewall, make sure to complete the following prerequisites. If you already have a firewall policy and a firewall, you can skip this section.

Set up the active threat defense managed rule group

With the prerequisites in place, you can set up and use the active threat defence managed rule group.

To set up the managed rule group:

In the AWS Network Firewall console, choose Firewall policies in the navigation pane.
Select an existing firewall policy or the policy that you created as part of the prerequisites.

Figure 2: Select the Network Firewall policy
Scroll down to Stateful rule groups. On the right-hand side, choose Actions and select Add managed stateful rule groups.

Figure 3: Add a rule group
On the Add managed stateful rule groups page, scroll down to active threat defense. Select the rule group AttackInfrastructure. Based on your requirements for Deep threat inspection, you can opt out if you don’t want Network Firewall to process service logs. Choose Add to policy.

Figure 4: Add the rule group to the policy
You can verify on the next page the managed rule group was added to the policy.

Figure 5: Verify that the managed rule group was added to the policy

Pricing

For active threat defense pricing, see AWS Network Firewall pricing.

Considerations

The first consideration is to understand how Network Firewall is more effective in detecting and mitigating threats associated with HTTPS traffic when the TLS inspection feature is used alongside the active threat defense managed rule group. TLS inspection enables active threat defense to analyze the actual content of encrypted connections, allowing it to identify and block malicious URLs that might otherwise pass undetected. This process involves decrypting traffic, inspecting the contents for known malicious URL patterns or behaviors, and then re-encrypting the traffic if it’s deemed safe. For more information on the considerations on TLS inspection, see Considerations for TLS inspection. Organizations must balance the security benefits with potential latency introduction and make sure that they have proper controls in place to handle sensitive decrypted data.

Another consideration is the mitigation of false positives. When you use this managed rule group in your firewall policy, you can edit rule group alert settings to help identify false-positives as part of a mitigation strategy. For more information, see mitigating false-positives.

The final consideration is how the use of managed rule groups count against the limit of stateful rules for each policy. For more information, see AWS Network Firewall quotas and Setting rule group capacity in AWS Network Firewall.

Conclusion

In this post, you learned how to use the AWS Network Firewall active threat defense managed rule group to safeguard workloads against active threats.

If you have feedback about this post, submit comments in the Comments section below.

Beyond compute: Shifting vulnerability detection left with Amazon Inspector code security capabilities

2025-06-17 Nirali Desai

Post Syndicated from Nirali Desai original https://aws.amazon.com/blogs/security/shifting-vulnerability-detection-left-with-amazon-inspector-code-security-capabilities/

Since launch, Amazon Inspector has helped customers automate vulnerability management for their running workloads on Amazon Elastic Compute Cloud (Amazon EC2), container workloads, and AWS Lambda functions. Today, we’re taking a step forward into more proactive security with the latest addition to Amazon Inspector: code security capabilities. By using this powerful new feature you can get a proactive view of the security health of your code. With native integration to source code managers (SCM) such as GitHub and GitLab, Amazon Inspector helps you identify and prioritize security vulnerabilities and misconfigurations across your application source-code, dependencies, and infrastructure as code (IaC).

Even if you make no changes to your code, there can be vulnerabilities in libraries that it depends on, creating risks for you and your users. By scanning repositories, you can continually monitor the security of your code and its dependencies. With Amazon Inspector, you can define consistent security controls throughout your software development lifecycle, so your security and development teams can collaborate effectively while reducing risk and remediation costs.

Overview of Amazon Inspector code security capabilities

Amazon Inspector now provides three additional security analysis capabilities: static application security testing (SAST), software composition analysis (SCA), and infrastructure as code (IaC) scanning. To use these features, you must establish a connection with your SCM tool (as shown in Figure 1). If you use GitHub, you can get started by installing and configuring the Amazon Inspector App from the GitHub Marketplace, which enables automated code analysis and delivers security findings directly within pull requests. If you use a self-managed GitLab, implementation is straightforward using a personal access token with the necessary permissions.

Figure 1: Code security landing page for Amazon Inspector

Static application security testing

Static application security testing (SAST) is the process of analyzing source code to identify insecure patterns or methods without needing to compile or execute the code. Amazon Inspector SAST scans analyze your source code to identify potential security vulnerabilities such as hardcoded secrets, cross-site scripting, or injection attacks across a wide range of programming languages including, JavaScript, Python, C#. The service also analyzes Bash shell scripts, extending security coverage beyond application code to include deployment and configuration scripts.

Software composition analysis

Software composition analysis (SCA) helps you understand and manage risks related to software dependencies. Every programming language has its own method of finding, importing, and updating contributed libraries. For example, PyPI for Python, NPM for NodeJS, and Cargo for Rust. Sometimes vulnerabilities are discovered in libraries distributed through the language-specific package distributions, or sometimes a library that you’re using depends on another library, and that dependency has the vulnerability. Amazon Inspector supports the major environments for Python, .Net, PHP, JavaScript, Java, Ruby, Rust, and Go. It automatically analyzes dependencies to identify known vulnerabilities and show you which code is affected. When vulnerabilities are detected, Amazon Inspector provides detailed information about the impact, available fixes, and upgrade paths to help you quickly remediate issues.

Infrastructure as code security

Just as applications are constructed from code, cloud infrastructure can be deployed and managed through code-based methods. Amazon Inspector now also analyzes IaC templates (as shown in Figure 2) to identify potential security misconfigurations, for example, the use of AWS Identity and Access Management (IAM) wildcards in action statements or disabled Glue Data Catalog encryption. This way identified risks can be fixed before the code is executed and the incorrect infrastructure is deployed. The new feature analyzes AWS CloudFormation, Terraform, and AWS CDK, helping you maintain secure infrastructure definitions throughout their development process. This capability helps make sure that security best practices are followed, and potential issues are caught before the infrastructure is deployed.

Figure 2: Amazon Inspector scan analysis configuration showing SAST, IaC, and SCA scan types

For the most up-to-date list of programming languages supported across Amazon Inspector code security capabilities, see the online documentation.

Improved security governance and visibility

Amazon Inspector lets you choose which scan types to run across which repositories (as shown in Figure 3). You can initiate a scan based on any of the following:

On-demand: Initiates an immediate scan of the selected repository
Change based: Initiates a scan on push to main branch, or on a pull request or merge request
Scheduled: Initiates a scan weekly or monthly

Figure 3: Overview of scan configurations

Amazon Inspector integrates code security findings into a unified dashboard that you can use to manage and enforce scanning policies across repositories using customizable scan configurations. As part of the integration workflow with the SCM platform, you can set up a default scan configuration that can be applied to existing or new repositories. Alternatively, you also have an option to create custom scan configurations that match specific existing repositories through inclusions tags.

Upon successful scheduled or event-based scans, Amazon Inspector generates detailed findings that pinpoint specific lines of code within repositories, including commit IDs and file locations where vulnerabilities are detected. Amazon Inspector empowers your security teams with customizable filtering through intelligent suppression rules. By using these options, you can tailor your security view to match your organization’s unique priorities, showing exactly what matters most to your team while preserving findings data for reporting and auditing. Through native Amazon EventBridge integration, these detailed security findings can be automatically routed into existing security workflows, enabling alerting and response capabilities.

Code fix recommendations

Amazon Inspector streamlines security remediation by providing specific code fix recommendations directly where developers work. The two-way integration with your SCM automatically suggests fixes as comments within pull requests (PRs) and merge requests (MRs) for Critical and High findings, alerting developers to the most important vulnerabilities to address without disrupting their workflow. Simultaneously, security teams benefit from a consolidated dashboard in the Amazon Inspector console that aggregates findings from scheduled or event-based scans across in-scope repositories. Each finding comes with tailored remediation guidance based on scan type (as shown in Figure 4): specific code suggestions for IAC and SAST findings, or recommended version upgrades and dependency update paths for SCA findings.

Figure 4: Code fix recommendation on a finding in the Amazon Inspector dashboard

Conclusion

These expanded security capabilities now deliver end-to-end visibility of the security health of your cloud applications, from initial code development through to production deployment. Security teams can use the unified dashboard in Amazon Inspector to track and manage vulnerabilities across repositories and application components, facilitating consistent security controls throughout the software lifecycle. Meanwhile, development teams receive immediate, actionable feedback within their source code repositories, creating a seamless security experience that bridges both security and development workflows. This approach is designed to help you maintain robust security practices while keeping development velocity high.

To get started with the new capabilities of Amazon Inspector, visit the Amazon Inspector console. For pricing details and implementation guidance, see documentation. These new features are now available in 10 AWS commercial Regions.

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

2025-06-16 Avijit Goswami

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/reduce-time-to-access-your-transactional-data-for-analytical-processing-using-the-power-of-amazon-sagemaker-lakehouse-and-zero-etl/

As the lines between analytics and AI continue to blur, organizations find themselves dealing with converging workloads and data needs. Historical analytics data is now being used to train machine learning models and power generative AI applications. This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently. Moreover, organizations attempting to consolidate disparate data sources into a data lakehouse have historically relied on extract, transform, and load (ETL) processes, which have become a significant bottleneck in their data analytics and machine learning initiatives. Traditional ETL processes are often complex, requiring significant time and resources to build and maintain. As data volumes grow, so do the costs associated with ETL, leading to delayed insights and increased operational overhead. Many organizations find themselves struggling to efficiently onboard transactional data into their data lakes and warehouses, hindering their ability to derive timely insights and make data-driven decisions. In this post, we address these challenges with a two-pronged approach:

Unified data management: Using Amazon SageMaker Lakehouse to get unified access to all your data across multiple sources for analytics and AI initiatives with a single copy of data, regardless of how and where the data is stored. SageMaker Lakehouse is powered by AWS Glue Data Catalog and AWS Lake Formation and brings together your existing data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses with integrated access controls. In addition, you can ingest data from operational databases and enterprise applications to the lakehouse in near real-time using zero-ETL which is a set of fully-managed integrations by AWS that eliminates or minimizes the need to build ETL data pipelines.
Unified development experience: Using Amazon SageMaker Unified Studio to discover your data and put it to work using familiar AWS tools for complete development workflows, including model development, generative AI application development, data processing, and SQL analytics, in a single governed environment.

In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio. Through this integrated environment, data analysts, data scientists, and ML engineers can use SageMaker Unified Studio to perform advanced SQL analytics on the transactional data.

Architecture patterns for a unified data management and unified development experience

In this architecture pattern, we show you how to use zero-ETL integrations to seamlessly replicate transactional data from Amazon Aurora MySQL-Compatible Edition, an operational database, into the Redshift Managed Storage layer. This zero-ETL approach eliminates the need for complex data extraction, transformation, and loading processes, enabling near real-time access to operational data for analytics. The transferred data is then cataloged using a federated catalog in the SageMaker Lakehouse Catalog and exposed through the Iceberg Rest Catalog API, facilitating comprehensive data analysis by consumer applications.

You then use SageMaker Unified Studio, to perform advanced analytics on the transactional data bridging the gap between operational databases and advanced analytics capabilities.

Prerequisites

Make sure that you have the following prerequisites:

An AWS account with permission to create AWS Identity and Access Management (IAM) roles and policies
Administrator access to one or more of the following sources: Aurora MySQL-Compatible and Amazon Redshift
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI). We recommend that you use temporary credentials using AWS IAM Identity Center as a security best practice for accessing your AWS accounts because these credentials are generated every time you sign in. See Getting IAM Identity Center user credentials for the AWS CLI or AWS SDKs steps to configure IAM Identity Center authentication.

Deployment steps

In this section, we share steps for deploying resources needed for Zero-ETL integration using AWS CloudFormation.

Setup resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use. The CloudFormation template provisions the following components:

An Aurora MySQL provisioned cluster (source).
An Amazon Redshift Serverless data warehouse (target).
Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless). See Aurora zero-ETL integrations with Amazon Redshift for more information.

Create your resources

To create resources using AWS Cloudformation, follow these steps:

Sign in to the AWS Management Console.
Select the us-east-1 AWS Region in which to create the stack.
Open the AWS CloudFormation
Choose Launch Stack
Choose Next.
This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
For Stack name, enter a stack name, for example UnifiedLHBlogpost.
Keep the default values for the rest of the Parameters and choose Next.
On the next screen, choose Next.
Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit.

Stack creation can take up to 30 minutes.

After the stack creation is complete, go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:
- NamespaceName
- PortNumber
- RDSPassword
- RDSUsername
- RedshiftClusterSecurityGroupName
- RedshiftPassword
- RedshiftUsername
- VPC
- Workgroupname
- ZeroETLServicesRoleNameArn

Implementation steps

To implement this solution, follow these steps:

Setting up zero-ETL integration

A zero-ETL integration is already created as a part of CloudFormation template provided. Use the following steps from the Zero-ETL integration post to complete setting up the integration.:

Bring Amazon Redshift metadata to the SageMaker Lakehouse catalog

Now that transactional data from Aurora MySQL is replicating into Redshift tables through zero-ETL integration, you next bring the data into SageMaker Lakehouse, so that operational data can co-exist and be accessed and governed together with other data sources in the data lake. You do this by registering an existing Amazon Redshift Serverless namespace that has Zero-ETL tables as a federated catalog in SageMaker Lakehouse.

Before starting the next steps, you need to configure data lake administrators in AWS Lake Formation.

Go to the Lake Formation console and in the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add.
In the Add administrators page, under Access type, select Data Lake administrator.
Under IAM users and roles, select Admin. Choose Confirm.

Add AWS Lake Formation Administrators

On the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Confirm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Add AWS Lake Formation Administrators 2

With the data lake administrators configured, you’re ready to bring your existing Amazon Redshift metadata to SageMaker Lakehouse catalog:

From the Amazon Redshift Serverless console, choose Namespace configuration in the navigation pane.
Under Actions, choose Register with AWS Glue Data Catalog. You can find more details on registering a federated Amazon Redshift catalog in Registering namespaces to the AWS Glue Data Catalog.

Choose Register. This will register the namespace to AWS Glue Data Catalog

After registration is complete, the Namespace register status will change to Registered to AWS Glue Data Catalog.
Navigate to the Lake Formation console and choose Catalogs New under Data Catalog in the navigation pane. Here you can see a pending catalog invitation is available for the Amazon Redshift namespace registered in Data Catalog.

Select the pending invitation and choose Approve and create catalog. For more information, see Creating Amazon Redshift federated catalogs.

Enter the Name, Description, and IAM role (created by the CloudFormation template). Choose Next.

Grant permissions using a principal that is eligible to provide all permissions (an admin user).
- Select IAM users and rules and choose Admin.
- Under Catalog permissions, select Super user to grant super user permissions.

Assigning super user permissions grants the user unrestricted permissions to the resources (databases, tables, views) within this catalog. Follow the principal of least privilege to grant users only the permissions required to perform a task wherever applicable as a security best practice.

As final step, review all settings and choose Create Catalog

After the catalog is created, you will see two objects under Catalogs. dev refers to the local dev database inside Amazon Redshift, and aurora_zeroetl_integration is the database created for Aurora to Amazon Redshift ZeroETL tables

Fine-grained access control

To set up fine-grained access control, follow these steps:

To grant permission to individual objects, choose Action and then select Grant.

On the Principals page, grant access to individual objects or more than one object to different principals under the federated catalog.

Access lakehouse data using SageMaker Unified Studio

SageMaker Unified Studio provides an integrated experience outside the console to use all your data for analytics and AI applications. In this post, we show you how to use the new experience through the Amazon SageMaker management console to create a SageMaker platform domain using the quick setup method. To do this, you set up IAM Identity Center, a SageMaker Unified Studio domain, and then access data through SageMaker Unified Studio.

Set up IAM Identity Center

Before creating the domain, makes sure that your data admins and data workers are ready to use the Unified Studio experience by enabling IAM Identity Center for single sign-on following the steps in Setting up Amazon SageMaker Unified Studio. You can use Identity Center to set up single sign-on for individual accounts and for accounts managed through AWS Organizations. Add users or groups to the IAM instance as appropriate. The following screenshot shows an example email sent to a user through which they can activate their account in IAM Identity Center.

Set up SageMaker Unified domain

Follow steps in Create a Amazon SageMaker Unified Studio domain – quick setup to set up a SageMaker Unified Studio domain. You need to choose the VPC that was created by the CloudFormation stack earlier.

The quick setup method also has a Create VPC option that sets up a new VPC, subnets, NAT Gateway, VPC endpoints, and so on, and is meant for testing purposes. There are charges associated with this, so delete the domain after testing.

If you see the No models accessible, you can use the Grant model access button to grant access to Amazon Bedrock serverless models for use in SageMaker Unified Studio, for AI/ML use-cases

Fill in the sections for Domain Name. For example, MyOLTPDomain. In the VPC section, select the VPC that was provisioned by the CloudFormation stack, for example UnifiedLHBlogpost-VPC. Select subnets and choose Continue.

In the IAM Identity Center User section, look up the newly created user from (for example, Data User1) and add them to the domain. Choose Create Domain. You should see the new domain along with a link to open Unified Studio.

Access data using SageMaker Unified Studio

To access and analyze your data in SageMaker Unified Studio, follow these steps:

1. Select the URL for SageMaker Unified Studio. Choose Sign in with SSO and sign in using the IAM user, for example datauser1, and you will be prompted to select a multi-factor authentication (MFA) method.
2. Select Authenticator App and proceed with next steps. For more information about SSO setup, see Managing users in Amazon SageMaker Unified Studio.After you have signed in to the Unified Studio domain, you need to set up a new project. For this illustration, we created a new sample project called MyOLTPDataProject using the project profile for SQL Analytics as shown here.A project profile is a template for a project that defines what blueprints are applied to the project, including underlying AWS compute and data resources. Wait for the new project to be set up, and when status is Active, open the project in Unified Studio.By default, the project will have access to the default Data Catalog (AWSDataCatalog). For the federated redshift catalog redshift-consumer-catalog to be visible, you need to grant permissions to the project IAM role using Lake Formation. For this example, using the Lake Formation console, we have granted below access to the demodb database that is part of the Zero-ETL catalog to the Unified Studio project IAM role. Follow steps in Adding existing databases and catalogs using AWS Lake Formation permissions.In your SageMaker Unified Studio Project’s Data section, connect to the Lakehouse Federated catalog that you created and registered earlier (for example redshift-zetl-auroramysql-catalog/aurora_zeroetl_integration). Select the objects that you want to query and execute them using the Redshift Query Editor integrated with SageMaker Unified Studio.If you select Redshift, you will be transferred to the Query editor where you can execute the SQL and see the results as shown in the following figure.

With this integration of Amazon Redshift metadata with SageMaker Lakehouse federated catalog, you have access to your existing Redshift data warehouse objects in your organizations centralized catalog managed by SageMaker Lakehouse catalog and join the existing Redshift data seamlessly with the data stored in your Amazon S3 data lake. This solution helps you avoid unnecessary ETL processes to copy data between the data lake and the data warehouse and minimize data redundancy.

You can further integrate more data sources serving transactional workloads such as Amazon DynamoDB and enterprise applications such as Salesforce and ServiceNow. The architecture shared in this post for accelerated analytical processing using Zero-ETL and SageMaker Lakehouse can be further expanded by adding Zero-ETL integrations for DynamoDB using DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse and for enterprise applications by following the instructions in Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Clean up

When you’re finished, delete the CloudFormation stack to avoid incurring costs for some of the AWS resources used in this walkthrough incur a cost. Complete the following steps:

On the CloudFormation console, choose Stacks.
Choose the stack you launched in this walkthrough. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack.
On the Sagemaker console, choose Domains and delete the domain created for testing.

Summary

In this post, you’ve learned how to bring data from operational databases and applications into your lake house in near real-time through Zero-ETL integrations. You’ve also learned about a unified development experience to create a project and bring in the operational data to the lakehouse, which is accessible through SageMaker Unified Studio, and query the data using integration with Amazon Redshift Query Editor. You can use the following resources in addition to this post to quickly start your journey to make your transactional data available for analytical processing.

About the authors

Avijit Goswami is a Principal Data Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Saman Irfan is a Senior Specialist Solutions Architect focusing on Data Analytics at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudarshan Narasimhan is a Principal Solutions Architect at AWS specialized in data, analytics and databases. With over 19 years of experience in Data roles, he is currently helping AWS Partners & customers build modern data architectures. As a specialist & trusted advisor he helps partners build & GTM with scalable, secure and high performing data solutions on AWS. In his spare time, he enjoys spending time with his family, travelling, avidly consuming podcasts and being heartbroken about Man United’s current state.

How to create post-quantum signatures using AWS KMS and ML-DSA

2025-06-13 Jake Massimo

Post Syndicated from Jake Massimo original https://aws.amazon.com/blogs/security/how-to-create-post-quantum-signatures-using-aws-kms-and-ml-dsa/

As the capabilities of quantum computing evolve, AWS is committed to helping our customers stay ahead of emerging threats to public-key cryptography. Today, we’re announcing the integration of FIPS 204: Module-Lattice-Based Digital Signature Standard (ML-DSA) into AWS Key Management Service (AWS KMS). Customers can now create and use ML-DSA keys through the same familiar AWS KMS APIs they use today for digital signatures, including CreateKey, Sign, and Verify operations. This new feature is generally available and you can use ML-DSA in the following AWS Regions: US West (N. California), and Europe (Milan) with the remaining commercial Regions to follow in the coming days. This launch is part of our broader AWS post-quantum cryptography migration plan, which we covered in our recent blog post. In this post, we guide you through creating ML-DSA keys and post-quantum signatures with AWS KMS.

Many organizations use AWS KMS to cryptographically sign firmware, operating systems, applications, or other artifacts. With ML-DSA support in AWS KMS, you can now generate and use post-quantum keys for signing operations within FIPS-140-3 Level 3 certified HSMs. By implementing ML-DSA signatures now, you can help make sure that your systems remain secure throughout their operational lifetime, even if cryptographically relevant quantum computers become available. This is especially important for manufacturers who install long-lived roots of trust during production—whether embedded directly in hardware or in devices that might remain offline for extended periods. In both cases, cryptographic signatures cannot be easily updated after deployment, making post-quantum readiness critical for the entire operational lifetime of these systems.

What’s new

AWS KMS offers three new AWS KMS key specs: ML_DSA_44, ML_DSA_65, and ML_DSA_87, which you can use with the new post-quantum SigningAlgorithm ML_DSA_SHAKE_256. Like our other signing algorithms, this name includes the hash function that’s used within the signature scheme to digest messages before signing or verification. In this case, the hash function used is SHAKE256—part of the SHA-3 family of hash functions standardized by NIST in FIPS 202.

Table 1 shows the details for each key spec, including their NIST security categories and corresponding key sizes in bytes. Each ML-DSA key spec represents a balance between security strength and resource requirements. ML-DSA-44 is suitable for applications requiring security comparable to classical 128-bit encryption, while ML-DSA-65 and ML-DSA-87 provide progressively stronger security levels equivalent to classical 192-bit and 256-bit encryption, respectively. As you move up in security levels, you’ll notice corresponding increases in key and signature sizes, enabling you to choose the key spec that best matches your security needs and engineering constraints.

Key spec	NIST security Level	Public key (B)	Private key (B)	Signature (B)
ML_DSA_44	1 (equivalent to 128-bit security)	1312	2560	2420
ML_DSA_65	3 (equivalent to 192-bit security)	1952	4032	3309
ML_DSA_87	5 (equivalent to 256-bit security)	2592	4896	4627

When using the AWS KMS Sign API with a RAW MessageType, the message to be signed is limited to 4096 bytes. For messages larger than 4096 bytes, pre-processing the message outside of AWS KMS to create what’s known as µ (mu) is required to generate a smaller-sized message input to the KMS Sign API. This external mu process pre-digests the message using the public key of the ML-DSA signing key pair to create a message size of 64 bytes. To support this launch, we’ve added a new message type in the KMS Sign API—EXTERNAL_MU—that can be used with ML-DSA signing or verification calls to indicate when a message has been pre-processed using µ (mu) before submitted to AWS KMS.

In the following sections, we include more information about constructing external mu and demonstrate basic AWS KMS operations with ML-DSA. We cover key creation, signature generation and verification, and both RAW and EXTERNAL_MU signing modes. Note that the produced RAW or EXTERNAL_MU ML-DSA signatures are identical when the same message and signing key are used.

Creating ML-DSA keys

To start, create an asymmetric AWS KMS key using the AWS Command Line Interface (AWS CLI) example command:

aws kms create-key --key-spec ML_DSA_65 --key-usage SIGN_VERIFY

This command will return a response similar to the following:

{
    "KeyMetadata": {
        "Origin": "AWS_KMS",
        "KeyId": "1234abcd-12ab-34cd-56ef-1234567890ab",
        "MultiRegion": false,
        "Description": "",
        "KeyManager": "CUSTOMER",
        "Enabled": true,
        "SigningAlgorithms": [
            "ML_DSA_SHAKE_256"
        ],
        "CustomerMasterKeySpec": "ML_DSA_65",
        "KeyUsage": "SIGN_VERIFY",
        "KeySpec": "ML_DSA_65",
        "KeyState": "Enabled",
        "CreationDate": 1748371316.734,
        "Arn": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab",
        "AWSAccountId": "111122223333"
    }
}

Make note of the KeyId or Arn value from the response; you’ll need this to reference your key in subsequent signing operations. The response confirms that the creation of an ML_DSA_65 key configured for SIGN_VERIFY operations, which will use the ML_DSA_SHAKE_256 signing algorithm for signature operations.

Signing

In this section, we include some examples of ML-DSA signing and verifying a JSON Web Token (JWT) commonly used to transfer claims between parties for web authorization. In 2021, we described how to sign and verify JWTs with Elliptic Curve Digital Signature Algorithm (ECDSA), a classic asymmetric cryptographic algorithm (see How to verify AWS KMS signatures in decoupled architectures at scale). In the following examples, the token is instead signed with an ML-DSA private key managed by AWS KMS and verified either within AWS KMS or externally using OpenSSL.

The JWT content to be signed is from section 3.1 of RFC7519. More specifically, the JWT header is:

{"typ":"JWT",
 "alg":"ML-DSA-65"}

And the JWT claim set is:

{"iss":"joe",
 "exp":1748952000,
 "http://example.com/is_root":true}

You can produce the JWT message to be signed by using the Base64URL encoding of the header and payload as:

echo -n -e '{"typ":"JWT",\015\012 "alg":"ML-DSA-65"}' | \
	basenc --base64url -w 0 | \
	sed 's/=//g' ; echo -n "." ; echo -n -e '{"iss":"joe",\015\012 "exp":1748952000,\015\012 "http://example.com/is_root":true}' | \
	basenc --base64url -w 0 | sed 's/=//g' ; echo ""

This command will output the following Base64 to be signed with ML-DSA:

eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ

Note that the following examples output the ML-DSA signature produced on the message by using the ML-DSA private key managed by AWS KMS in a binary format. You need to convert them to Base64URL to use them in JWT, but various data encryption and signing formats can use these signatures. These include Cryptographic Message Syntax (CMS), CBOR Object Signing and Encryption (COSE), or image signing encodings for UEFI and Open Titan. While converting between binary and these formats is straightforward, support for the new algorithms might not be available in common cryptographic implementations of these signing formats at the time of this writing.

RAW ML-DSA signing (no external mu)

To sign a message of less than 4096 bytes in AWS KMS with ML-DSA, you can use the AWS CLI:aws kms sign \

aws kms sign \
    --key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
    --message ' eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ' \
    --message-type RAW \
    --signing-algorithm ML_DSA_SHAKE_256 \
    --output text \
    --query Signature | base64 --decode > ExampleSignature.bin

Make sure to replace the target-key-id value of <1234abcd-12ab-34cd-56ef-1234567890ab> with your KeyId. This command will produce a signature and write it to disk as ExampleSignature.bin.

After producing the signature, you can create the complete JWT (consisting of header, payload, and signature) with a single command:

echo -n "eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ." ; \
	basenc --base64url -w 0 ExampleSignature.bin | \
	sed 's/=//g' ; echo ""

This command will output a ready-to-use JWT in the format required by RFC 7519 and signed using AWS KMS:

eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ.<base64url of the signature as per RFC7519>

External mu ML-DSA signing

Note that AWS KMS imposes a 4096-byte limit on the size of the raw message when using the Sign API to minimize the latency of the response. In cases where the message to be signed is larger than 4096 bytes or if pre-digesting the external mu has performance advantages you need, you must use the EXTERNAL_MU message type instead of RAW in AWS KMS.

Before using the EXTERNAL_MU message type with the AWS KMS Sign API, you must locally perform a pre-hash calculation on your message. So, first, retrieve the public key from AWS KMS, and convert it to DER format using the following command (replace the example key ID with a valid key ID from your AWS account):

aws kms get-public-key \
    --key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
    --output text \
    --query PublicKey | base64 --decode > public_key.der

To construct the external mu digest:

Construct a message prefix (M`): M` = (domain separator || context length || context || Message).
In this example, set the domain separator value and context length as zero; this sets the context used in the signature as the empty string, which is the default.
Hash the public key then prepend it to the message prefix:
(SHAKE256(pk) || M’).
Hash to produce a 64-byte mu:
Mu = SHAKE256(SHAKE256(pk) || M’)

You can use a single OpenSSL 3.5 command to construct the digest:

{
    openssl asn1parse -inform DER -in public_key.der -strparse 17 -noout -out - 2>/dev/null |
    openssl dgst -provider default -shake256 -xoflen 64 -binary;
    printf '\x00\x00';
    echo -n "eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ"
} | openssl dgst -provider default -shake256 -xoflen 64 -binary > mu.bin

Now you can call AWS KMS to sign the 64-byte digest to produce the ML-DSA signature in file ExampleSignature.bin, making sure to set the MessageType to EXTERNAL_MU:

aws kms sign \
    --key-id 1234abcd-12ab-34cd-56ef-1234567890ab \
    --message fileb://mu.bin \
    --message-type EXTERNAL_MU \
    --signing-algorithm ML_DSA_SHAKE_256 \
    --output text \
    --query Signature | base64 --decode > ExampleSignature.bin

The final signed JWT token is identical to the one produced previously in RAW mode.

Signature verification using AWS KMS

In this section, we show you how to verify ML-DSA signatures using AWS KMS or locally in your own environment. We assume that you have an ML-DSA signature in ExampleSignature.bin, produced on the JWT content with the private key in AWS KMS and identified with KEY_ARN.

Note that, although the following examples demonstrate signature verification using public keys directly from AWS KMS, these same principles extend to certificate-based systems, such as a private PKI, in which public keys are embedded in end-entity certificates (of the signer). In such scenarios, verifiers would first verify the identity of the signer by validating the certificate chain ties to a trusted root, then use the public key of the end-entity certificate to verify the ML-DSA signature of the content. The IETF is standardizing ML-DSA for use in X.509 certificates through RFC draft draft-ietf-lamps-dilithium-certificates.

RAW ML-DSA verification

To verify the signature using AWS KMS, you can call the following command, replacing the example key-id with the same one you used to sign.

aws kms verify \
    --key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
    --message "eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ" \
    --message-type RAW \
    --signing-algorithm ML_DSA_SHAKE_256 \
    --signature fileb://ExampleSignature.bin

The response will return:

{
    "KeyId": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab",
    "SignatureValid": true,
    "SigningAlgorithm": "ML_DSA_SHAKE_256"
}

The verification result is stored in the SignatureValid field.

External mu ML-DSA verification

If you have the external mu digest of the JWT content in mu.bin along with the signature and the corresponding keypair in AWS KMS, you can use the digest without having access to the entire message or calculating the digest again.

aws kms verify \
    --key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
    --message fileb://mu.bin \
    --message-type EXTERNAL_MU \
    --signing-algorithm ML_DSA_SHAKE_256 \
    --signature fileb://ExampleSignature.bin

To regenerate the external mu mu.bin from the message and the public key, see the External mu ML DSA signing section above.

Local signature verification using OpenSSL 3.5

If you want to reduce AWS KMS API consumption costs and better control the use of API quotas while keeping the security of AWS KMS-generated and stored keys for ML-DSA signature generation, you can verify ML-DSA signatures locally, outside of AWS KMS.

In this example, you use OpenSSL 3.5 to verify the signature in ExampleSignature.bin. You first must fetch the DER-encoded public key from AWS KMS in file public_key.der as shown in the External mu ML DSA signing section. OpenSSL 3.5 can then verify the signature on the message by using the public key.

echo -n "eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJNTC1EU0EtNjUifQ.eyJpc3MiOiJqb2UiLA0KICJleHAiOjE3NDg5NTIwMDAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ" | \
	openssl dgst -verify public_key.der -signature ExampleSignature.bin

Successful verification will output: Verified OK

Conclusion

Today’s launch of ML-DSA support in AWS KMS marks an important milestone in our commitment to post-quantum cryptography. With three different security levels of ML-DSA in both raw and external digest modes, you have flexible options to meet your security requirements while preparing for the quantum computing era. The seamless integration with existing AWS KMS APIs makes it straightforward to incorporate quantum-resistant signatures into your applications today. This implementation is particularly valuable if you need to:

Meet FIPS 140-3 compliance requirements when using post-quantum cryptography.
Sign code, artifacts, documents or other data that need to remain trusted and verifiable for many years into the future, including the period after cryptographically relevant quantum computers exist.
Start post-quantum cryptography testing as part of your application development process using a cryptographic service such as AWS KMS that has previously been approved for use.

Learn more about post-quantum cryptography in general and the overall AWS plan to migrate to post-quantum cryptography.

Amazon Q Developer Java Upgrades: A Deep Dive into the New Selective Transformation Feature

2025-06-13 Venugopalan Vasudevan

Post Syndicated from Venugopalan Vasudevan original https://aws.amazon.com/blogs/devops/amazon-q-developer-java-upgrades-a-deep-dive-into-new-selective-transformation-feature/

In the ever-evolving landscape of Java development, keeping applications up-to-date while minimizing risk has become increasingly challenging. Amazon Q Developer transformation capabilities now support customization of Java upgrades in Java upgrade transformation CLI (command line interface) with a new selective transformation feature. Selective transformation empowers development teams with greater control over their modernization journey. Instead of risky “big bang” upgrades, teams can now precisely target specific components and libraries for transformation while maintaining application stability. This surgical approach to modernization supports two key scenarios: individual developer-driven upgrades and orchestrated transformation campaigns managed by Center of Excellence (CoE) teams.

Using this feature, you can use natural language chat and/or an input file to tailor transformation plans and exercise greater control over Java upgrades. The following options are supported:

Selection of steps from a transformation plan and breakdown of a transformation job for granular code reviews.
Selection of first-party and third-party dependencies, along with their versions, that should be upgraded during JDK version upgrades.

In this blog post, we’ll explore how Java upgrade transformation CLI’s selective transformation capabilities help development teams efficiently manage Java version upgrades, reduce technical debt, and modernize their applications with minimal disruption. We’ll demonstrate practical examples of various scenarios of upgrading First-Party and Third-Party dependencies and also using an input file or natural language to guide the transformation process.

About Selective Transformation

With introduction of this selective transformation feature, the java upgrades will be completed in two phases:

Job 1 – Minimum JDK Upgrade: The first qct transform command will focus on performing the minimum changes necessary to upgrade the project JDK version.
Job 2 – Dependency Upgrade: To upgrade the project’s dependencies, run the qct transform command again on the newly upgraded Java 17/21 project. This second job will then handle only the dependency upgrades.

Dependency Upgrade Input file

Dependency upgrade file is an optional input to the qct transform command where the user can specify the versions of first-party and third-party dependencies that needs to be upgraded.

Structure the dependency_upgrade.yml (or any other name you prefer) in the following format:

name: dependency-upgrade
description: "Custom dependency version management for Java migration from JDK 8/11/17 to JDK 17/21"

dependencyManagement:
  dependencies:
    - identifier: "groupId:artifactId" # Required
      targetVersion: "2.1.0" # Required
      versionProperty: "library1.version"  # Optional
      originType: "FIRST_PARTY" # or "THIRD_PARTY"  # Required
    - identifier: "com.example:library2" # Required
      targetVersion: "3.0.0" # Required
      originType: "THIRD_PARTY" # Required
  plugins:
    - identifier: "groupId:artifactId"
      targetVersion: "1.2.0"
      originType: "THIRD_PARTY"
      versionProperty: "plugin.version"  # Optional

For each dependency or plugin you want to upgrade:

- Under dependencies or plugins, add a new entry.

- Specify the identifier

- Set the targetVersion to the desired version.

- Specify originType as “FIRST_PARTY” or “THIRD_PARTY”.

- Optionally, include versionProperty if the version is managed by a property.
When running the migration command, include the --dependency_upgrade_file flag followed by the path to your YML file:

qct transform \
--source_folder <path-to-folder>\
--target_version <17 or 21> \
--dependency_upgrade_file <path-to-dependency_upgrade.yml>\
--no-interactive

Interactive and No-Interactive Mode

You can run the selective transformation upgrades in either no-interactive or interactive mode

For no-interactive mode , you need to specify --no-interactive flag , where the transformation will proceed with planning and execution without waiting for any user input in an interactive fashion.

Interactive mode is a new “chat” option in the CLI where once the plan is generated, user can type feedback in natural language and specify to skip steps or specify particular versions of dependencies to be upgraded to guide the transformation process.

Interactive Mode Usage Examples:

Ask to change dependencies “Can you upgrade junit to version 4.15 instead of 4.12?”
Ask to remove steps “Can you skip plan step 3”
Ask to remove certain dependencies “I don’t want my springboot to be upgraded at this time”
Invalid Input (should be thrown away and will prompt again) “What is the capital of France?”
Start message: “The plan looks good” or “Go ahead with transformation” or “Looks Good”
Add first party dependency “Could you help me also upgrade the dependency XXX:XXXX”

Example Transformation

Pre-requisites:

Refer to the link for general instructions on installation of transformation CLI : https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/run-CLI-transformations.html
Clone the repo from https://github.com/aws-samples/aws-appconfig-java-sample

Mode 1 : Interactive : Upgrade Java v1.8 to Java v21

We will use interactive mode to transform this 1.8 project to 21 along with a 1P dependency upgrade to 21 as well.

Refer to example 1p dependency upgrade file

Initiate the transformation using command below:

qct transform \
--source_folder /home/ec2-user/qct/aws-appconfig-java-sample\
--target_version 21 \
--dependency_upgrade_file /home/ec2-user/qct/dependency_upgrade_1p.yml\
--no-interactive

Amazon Q performs transformations based on your project's requests, descriptions, and content. To maintain security, avoid including external, unvetted artifacts in your project repository and always validate transformed code for both functionality and security. Do you want to proceed? [Y/N]: Y

Choose Y to proceed with the transformation.

Once the Job is accepted, during the planning phase, agent will display the plan based on the input dependency upgrade file provided to include 1P upgrade as part of the plan. (if no dependency upgrade file is provided, user can still provide feedback on the plan). Here we say Looks good, proceed with the transformation.

For this transformation, I'll make the necessary changes to upgrade your Java 8 application to Java 21.

Here is the transformation plan that includes your first party dependencies:
Step 0: Minimal migration to Java 21
Step 1:

            * Update/Add 1P dependency com.amazonaws.samples:movie-service-utils to version 0.3.0

If you would like to modify the plan, you can ask me to:

    * Add first party dependencies and versions to upgrade to
    * Change the target versions of the first party dependencies

You can enter plan feedback, or let me know if you want to start the transformation now: Looks good, proceed with the transformation

If is there is any user feedback , the agent will display the revised plan for the user to accept, if not it will proceed with the transformation. Upon completion, the agent will display the status, provide the location of the summary file containing the changes, and confirm the creation of a new branch with these changes. You can run git diff mainlineto review the changes and accept.

Fig 1 part of pom.xml changes after transformation from 8 to 21

Fig 1: part of pom.xml changes after transformation from 8 to 21

The transformation agent was able to upgrade Java 8 to Java 21 version along with dependencies minimally required for v21 and also the 1P dependency specified in the upgrade file.

Mode 1 : Interactive : Upgrade dependencies

Initiate the transformation using the same command as seen below:

qct transform \
--source_folder /home/ec2-user/qct/aws-appconfig-java-sample\
--target_version 21 \

Once the Job is accepted, during the planning phase, the transformation agent will display the transformation plan and ask the user input for any feedback to upgrade 3P dependencies to a specified version if needed.

For this transformation, I'll upgrade libraries and other dependencies to modernize your Java 21 application.

Here is the transformation plan:
Step 1:

        * Update javax.validation:javax.validation-api

Step 2:

        * Update org.mockito:*
        * Update org.springframework.boot:spring-boot-starter-parent to version 3.3.4
        * Update org.springframework.boot:spring-boot-starter-test to version 3.3.4

Step 3:

        * Update org.apache.logging.log4j:*

Step 4:

        * Update org.springframework.boot:spring-boot-maven-plugin to version 3.3.4
        * Update org.springframework.boot:spring-boot-starter-web to version 3.3.4

Step 5:

        * Update org.apache.logging.log4j:log4j-api to version 2.24.0
        * Update org.apache.logging.log4j:log4j-core to version 2.24.0

Step 6:

        * Update org.json:json to version 20240303

Step 7:

        * Update software.amazon.awssdk:appconfig to version 2.28.6
        * Update software.amazon.awssdk:bom to version 2.28.6

If you would like to modify the plan, you can ask me to:

* Only upgrade certain libraries
* Change the target version of a library
* Only perform certain steps in the plan

You can enter plan feedback, or let me know if you want to start the transformation now:

For this example,

Let’s say "Skip Step 6, upgrade org.springframework.boot:spring-boot related libraries and plugin to 3.4.5, upgrade software.amazon.awssdk:appconfig to version 2.31.40 and Update software.amazon.awssdk:bom to version to 2.31.40"

Transformation agent will display the revised plan for the user to accept, and ask confirmation to proceed with the transformation.

Here is the updated transformation plan:
Step 1:

        * Update javax.validation:javax.validation-api

Step 2:

        * Update org.mockito:*
        * Update org.springframework.boot:spring-boot-starter-parent to version 3.4.5
        * Update org.springframework.boot:spring-boot-starter-test to version 3.4.5

Step 3:

        * Update org.apache.logging.log4j:*

Step 4:

        * Update org.springframework.boot:spring-boot-maven-plugin to version 3.4.5
        * Update org.springframework.boot:spring-boot-starter-web to version 3.4.5

Step 5:

        * Update org.apache.logging.log4j:log4j-api to version 2.24.0
        * Update org.apache.logging.log4j:log4j-core to version 2.24.0

Step 6:

        * Update software.amazon.awssdk:appconfig to version 2.31.40
        * Update software.amazon.awssdk:bom to version 2.31.40

If you would like to modify the plan, you can ask me to:

* Only upgrade certain libraries
* Change the target version of a library
* Only perform certain steps in the plan

You can modify the plan 4 more time(s) before I start the transformation.

You can enter plan feedback, or let me know if you want to start the transformation now: Looks good

Fig 2: part of pom.xml changes after dependency upgrades

The transformation agent was able to upgrade 3P dependencies specified via the interactive mode during the planning stage.

Mode 2 : No-Interactive : Java v1.8 to Java v21

We will use no-interactive mode to transform this 1.8 project to 21 along with 1P version upgrades with dependency upgrade

The transformation agent will not wait for any user inputs and directly upgrade the project from Java 1.8 to 21 with along with dependencies minimally required for this upgrade.

Refer to example 1p dependency upgrade file

Initiate the transformation using command below:

qct transform \
--source_folder /home/ec2-user/qct/aws-appconfig-java-sample \
--target_version 21 \
--dependency_upgrade_file /home/ec2-user/qct/dependency_upgrade_1p.yml \
--no-interactive

Fig 3 part of pom.xml changes showing 1P upgrades

Fig 3: part of pom.xml changes showing 1P upgrades

The transformation agent was able to upgrade the Java version along with 1P dependency specified.

Mode 2 : No-Interactive : Upgrade dependencies

We will use no-interactive mode to upgrade the 3P dependencies

Refer to example 3p dependency upgrade file

Initiate the command below:

qct transform \
--source_folder /home/ec2-user/qct/aws-appconfig-java-sample \
--target_version 21 \
--dependency_upgrade_file /home/ec2-user/qct/dependency_upgrade_3p.yml \
--no-interactive

Fig 4 of pom.xml changes showing 3P upgrades

Fig 4: part of pom.xml changes showing 3P upgrades

The transformation agent was able to upgrade 3P dependencies along with the versions provided by the user via the dependency upgrade file.

Conclusion

The introduction of selective transformation in Java upgrade transformation CLI marks a significant evolution in how teams can approach Java modernization. By offering granular control over upgrade paths, supporting natural language interactions, and enabling targeted dependency management, this feature transforms what was once a daunting technical challenge into a manageable, incremental process. As a next step, start by identifying your most critical components that need upgrading, and leverage the selective transformation feature to create a tailored upgrade strategy. Visit the Amazon Q Developer transformation CLI documentation to learn more about implementing these capabilities in your development workflow, and join the growing community of developers who are revolutionizing their approach to Java modernization. The future of efficient, risk-managed Java upgrades is here – it’s time to embrace it.

About the authors

	Saptarshi Banerjee serves as a Senior Solutions Architect at AWS, collaborating closely with AWS Partners to design and architect mission-critical solutions. With a specialization in generative AI, AI/ML, serverless architecture, Next-Gen Developer Experience tools and cloud-based solutions, Saptarshi is dedicated to enhancing performance, innovation, scalability, and cost-efficiency for AWS Partners within the cloud ecosystem.
	Sureshkumar Natarajan is a Senior Technical Account Manager at AWS based in Denver, CO. He specializes in supporting Greenfield and SMB customers on the AWS platform. His expertise includes AWS Generative AI Services, AWS ECS/EKS Container solutions, and helping Enterprise Support customers to build well-architected solutions in AWS
	Venugopalan is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), where he specializes in AWS Generative AI services. His expertise lies in helping customers leverage cutting-edge services like Amazon Q, and Amazon Bedrock to streamline development processes, accelerate innovation, and drive digital transformation.

Validating event payload with Powertools for AWS Lambda (TypeScript)

2025-06-12 Alexander Schüren

Post Syndicated from Alexander Schüren original https://aws.amazon.com/blogs/compute/validating-event-payload-with-powertools-for-aws-lambda-typescript/

In this post, learn how the new Powertools for AWS Lambda (TypeScript) Parser utility can help you validate payloads easily and make your Lambda function more resilient.

Validating input payloads is an important aspect of building secure and reliable applications. This ensures that data that an application receives can gracefully handle unexpected or malicious inputs and prevent harmful downstream processing. When writing AWS Lambda functions, developers need to validate and verify the payload and ensure that specific fields and values are correct and safe to process.

Powertools for AWS Lambda is a developer toolkit available in Python, NodeJS/TypeScript, Java and .NET. It helps implement serverless best practices and increase developer velocity. Powertools for AWS Lambda (TypeScript) is introducing a new Parser utility to help developers more easily implement validation in their Lambda functions.

Why payload validation matters

Validating payloads can make your Lambda functions more resilient. Payloads that combine both technical and business information can also be challenging to validate. This requires writing validation logic inside your Lambda function code.This could range from a few if-statements to check payload values to a complex series of validation steps based on custom business logic. You may need to separate the validation of the technical information of payload like AWS Region, accountId, event source and business information inside the event such as productId and payment details.

It can be challenging to know the structure and the values of the event object and how to extract the relevant information. For example, an Amazon SQS event has a body field with a string value, which can be a JSON document. Amazon EventBridge has an object in the detail field that you can read directly without further transformation. You may need to decompress, decode, transform, and validate the payload inside a specific field. Understanding the many transformation layers can be complex, especially if your event object is a result of multiple service invocations.

Using the Powertools for AWS Lambda (TypeScript) Parser Utility

Powertools for AWS Lambda (TypeScript) is a modular library. You can selectively install features such as Logger, Tracer, Metrics, Batch Processing, Idempotency, and more. You can use Powertools for AWS Lambda in both TypeScript and JavaScript code bases. The new Parser utility simplifies validation and uses the popular validation library, Zod.

You can use parser as method decorator, with middyjs middleware or manually in all Lambda provided NodeJS runtimes

To use the utility, install the Powertools parser utility and Zod (<v3.x) using NPM or any package manager of your choice:

npm install @aws-lambda-powertools/parser zod@~3

You can define your schema using Zod. Here is an example of a simple order schema for validating events:

import { z } from 'zod';
const orderSchema = z.object({
    id: z.number().positive(),
	description: z.string(),
	items: z.array(
	    z.object({
		    id: z.number().positive(),
			quantity: z.number(),
			description: z.string(),
			})
		),
	});
export { orderSchema };

This order schema defines id, description and a list of items. You can specify the value types from simple numeric, narrow it down to positive or literal, or more complex like unition, array or even other schema. Zod offers an extensive list of value types that you can use.

Add the parser decorator to your handler function and set the schema parameter and use this schema to parse the event object.

import type {Context} from 'aws-lambda';
import type {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {parser} from '@aws-lambda-powertools/parser';
import {z} from 'zod';
import {Logger} from '@aws-lambda-powertools/logger';

const logger = new Logger();

const orderSchema = z.object({
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(
	    z.object({
		    id: z.number().positive(),
			quantity: z.number(),
			description: z.string(),
		})
	),
});

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {
    @parser({schema: orderSchema})  
	public async handler(event: Order, _context: Context): Promise<void> {
	    // event is now typed as Order    
		for (const item of event.items) {      
		    logger.info('Processing item', {item});
			// process order item from the event
		}
	}
}
	
const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

Note that z.infer helps to extract the Order type from the schema, which improves development experience with autocomplete when using TypeScript. Zod parses the entire object, including nested fields, and reports all the errors combined, instead of returning only the first error.

Using built-in schema for AWS services

A more common scenario is to validate events from AWS Services that trigger Lambda functions, including Amazon SQS, Amazon EventBridge and many more. To make this easier Powertools includes pre-built schema for AWS events that you can use.

To parse an incoming Amazon EventBridge event, set the built-in schema in your parser configuration:

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeSchema} from '@aws-lambda-powertools/parser/schemas';
import type {EventBridgeEvent} from '@aws-lambda-powertools/parser/types';

class Lambda implements LambdaInterface {  
    @parser({schema: EventBridgeSchema})  
	public async handler(event: EventBridgeEvent, _context: Context): Promise<void> {    
	    // event is parsed but the detail field is not specified  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

The event object is parsed and validated during runtime and the TypeScript type EventBridgeEvent helps you understand the structure and access the fields during development. In this example, you only parse the EventBridge event object, so the detail field can be an arbitrary object.

You can also extend the built-in EventBridge schema and override the detail field with your custom oderSchema.

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeSchema} from '@aws-lambda-powertools/parser/schemas';
import {z} from 'zod';

const orderSchema = z.object({  
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(
	    z.object({
		    id: z.number().positive(),      
			quantity: z.number(),      
			description: z.string(),
		}), 
	),
});

const eventBridgeOrderSchema = EventBridgeSchema.extend({  detail: orderSchema,});

type EventBridgeOrder = z.infer<typeof eventBridgeOrderSchema>;

class Lambda implements LambdaInterface {  
    @parser({schema: eventBridgeOrderSchema})  public async handler(event: EventBridgeOrder, _context: Context): Promise<void> {
	    // event.detail is now parsed as orderSchema  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

The parser validates the full structure of the entire EventBridge event including the custom business object. Use .extend or other Zod schema functions to change any field of the built-in schema and customize the payload validation.

Using envelopes with custom schema

In some cases, you only need the custom portion of the payload, for example, the detail field of the EventBridge event or the body of SQS records. This requires you to parse the event schema manually, extract the required field, and then parse it again with the custom schema. This is complex as you must know the exact payload field and how to transform and parse it.

Powertools Parser utility helps solve this problem with Envelopes. Envelopes are schema objects with built-in logic to extract custom payloads.

Here is an example of the EventBridgeEnvelope of how it works:

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeEnvelope} from '@aws-lambda-powertools/parser/envelopes';
import {z} from 'zod';

const orderSchema = z.object({  
    id: z.number().positive(), 
	description: z.string(),  
	items: z.array(    
	    z.object({      
		    id: z.number().positive(),      
			quantity: z.number(),      
			description: z.string(),
		}), 
	),
});

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {  
    @parser({schema: orderSchema, envelope: EventBridgeEnvelope})  public async handler(event: Order, _context: Context): Promise<void> {
        // event is now typed as Order inferred from the orderSchema  
    }
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

By setting schema and envelope, the parser utility knows how to combine both parameters, extract, and validate the custom payload from the event. Powertools Parser transforms the event object according to the schema definition so you can focus on your business-critical part of the code inside the handler function.

Safe parsing

If the object does not match the provided Zod schema, by default, the parser throws ParserError. If you require control over validation errors and need to implement custom error handling, use the safeParse option.

Here is an example how to capture failed validations as a metric in your handler function:

import {Logger} from "@aws-lambda-powertools/logger";
import {LambdaInterface} from "@aws-lambda-powertools/commons/types";
import {parser} from "@aws-lambda-powertools/parser";
import {orderSchema} from "../app/schema";import {z} from "zod";
import {EventBridgeEnvelope} from "@aws-lambda-powertools/parser/envelopes";
import {Metrics, MetricUnit} from "@aws-lambda-powertools/metrics";
import {ParsedResult, EventBridgeEvent} from "@aws-lambda-powertools/parser/types";

const logger = new Logger();

const metrics = new Metrics();

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {  

    @metrics.logMetrics() 
	@parser({schema: orderSchema, envelope: EventBridgeEnvelope, safeParse: true})  
	public async handler(event: ParsedResult<EventBridgeEvent, Order>, _context: unknown): Promise<void> {
	    if (!event.success) {      
		    // failed validation      
			metrics.addMetric('InvalidPayload', MetricUnit.Count, 1);      
			logger.error('Invalid payload', event.originalEvent);    
		} else {      
		    // successful validation      
			for (const item of event.data.items) {        
			    logger.info('Processing item', item);        
				// event.data is typed as Order      
			}   
		}  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

Setting safeParse option to true does not throw an error, but returns a modified event object that has a success flag and either error or data fields, depending on the validation result. You can then create custom error handling, for example incrementing InvalidPayload metric and access the originalEvent to log an error.

For successful validations, you can access the data field and process the payload. Note that the event object type is now ParsedResult with the EventBridgeEvent as input and Order as output types.

Custom validations

Sometimes you may require more complex business rules for your validation. Because Parser built-in schemas are Zod objects, you can customize the validation by applying .extend, .refine , .transform and other Zod operators. Here is an example of complex rules for the orderSchema:

import {z} from 'zod';

const orderSchema = z.object({
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(z.object({
	    id: z.number().positive(),    
		quantity: z.number(),    
		description: z.string(),  
	})).refine((items) => items.length > 0, {
	    message: 'Order must have at least one item',  
	}),
})  
    .refine((order) => order.id > 100 && order.items.length > 100, {
	    message:      'All orders with more than 100 items must have an id greater than 100', 
	});

Use .refine on items field to check if there is at least one item in the order. You can also combine multiple fields, here order.id and order.items.length, to have a specific rule for orders with more than 100 items. Keep in mind that .refine runs during the validation step and .transform will be applied after the validation. This allows you to change the shape of the data to normalize the output.

Conclusion

Powertools for AWS Lambda (TypeScript) is introducing a new Parser utility that makes it easier to add validation to your Lambda functions. By relying on the popular validation library Zod, Powertools offers an extensive set of built-in schemas for popular AWS service integrations including Amazon SQS, Amazon DynamoDB, Amazon EventBridge. Developers can use these schemas to validate their event payloads and also customize them according to their business needs.

Visit the documentation to learn more and join our Powertools community Discord to connect with like-minded serverless enthusiasts.