Tag Archives: Intermediate (200)

Amazon Managed Service for Apache Flink now supports Apache Flink version 1.18

Post Syndicated from Lorenzo Nicora original https://aws.amazon.com/blogs/big-data/amazon-managed-service-for-apache-flink-now-supports-apache-flink-version-1-18/

Apache Flink is an open source distributed processing engine, offering powerful programming interfaces for both stream and batch processing, with first-class support for stateful processing and event time semantics. Apache Flink supports multiple programming languages, Java, Python, Scala, SQL, and multiple APIs with different level of abstraction, which can be used interchangeably in the same application.

Amazon Managed Service for Apache Flink, which offers a fully managed, serverless experience in running Apache Flink applications, now supports Apache Flink 1.18.1, the latest version of Apache Flink at the time of writing.

In this post, we discuss some of the interesting new features and capabilities of Apache Flink, introduced with the most recent major releases, 1.16, 1.17, and 1.18, and now supported in Managed Service for Apache Flink.

New connectors

Before we dive into the new functionalities of Apache Flink available with version 1.18.1, let’s explore the new capabilities that come from the availability of many new open source connectors.

OpenSearch

A dedicated OpenSearch connector is now available to be included in your projects, enabling an Apache Flink application to write data directly into OpenSearch, without relying on Elasticsearch compatibility mode. This connector is compatible with Amazon OpenSearch Service provisioned and OpenSearch Service Serverless.

This new connector supports SQL and Table APIs, working with both Java and Python, and the DataStream API, for Java only. Out of the box, it provides at-least-once guarantees, synchronizing the writes with Flink checkpointing. You can achieve exactly-once semantics using deterministic IDs and upsert method.

By default, the connector uses OpenSearch version 1.x client libraries. You can switch to version 2.x by adding the correct dependencies.

Amazon DynamoDB

Apache Flink developers can now use a dedicated connector to write data into Amazon DynamoDB. This connector is based on the Apache Flink AsyncSink, developed by AWS and now an integral part of the Apache Flink project, to simplify the implementation of efficient sink connectors, using non-blocking write requests and adaptive batching.

This connector also supports both SQL and Table APIs, Java and Python, and DataStream API, for Java only. By default, the sink writes in batches to optimize throughput. A notable feature of the SQL version is support for the PARTITIONED BY clause. By specifying one or more keys, you can achieve some client-side deduplication, only sending the latest record per key with each batch write. An equivalent can be achieved with the DataStream API by specifying a list of partition keys for overwriting within each batch.

This connector only works as a sink. You cannot use it for reading from DynamoDB. To look up data in DynamoDB, you still need to implement a lookup using the Flink Async I/O API or implementing a custom user-defined function (UDF), for SQL.

MongoDB

Another interesting connector is for MongoDB. In this case, both source and sink are available, for both the SQL and Table APIs and DataStream API. The new connector is now officially part of the Apache Flink project and supported by the community. This new connector replaces the old one provided by MongoDB directly, which only supports older Flink Sink and Source APIs.

As for other data store connectors, the source can either be used as a bounded source, in batch mode, or for lookups. The sink works both in batch mode and streaming, supporting both upsert and append mode.

Among the many notable features of this connector, one that’s worth mentioning is the ability to enable caching when using the source for lookups. Out of the box, the sink supports at-least-once guarantees. When a primary key is defined, the sink can support exactly-once semantics via idempotent upserts. The sink connector also supports exactly-once semantics, with idempotent upserts, when the primary key is defined.

New connector versioning

Not a new feature, but an important factor to consider when updating an older Apache Flink application, is the new connector versioning. Starting from Apache Flink version 1.17, most connectors have been externalized from the main Apache Flink distribution and follow independent versioning.

To include the right dependency, you need to specify the artifact version with the form: <connector-version>-<flink-version>

For example, the latest Kafka connector, also working with Amazon Managed Streaming for Apache Kafka (Amazon MSK), at the time of writing is version 3.1.0. If you are using Apache Flink 1.18, the dependency to use will be the following:

<dependency> 
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId> 
    <version>3.1.0-1.18</version>
</dependency>

For Amazon Kinesis, the new connector version is 4.2.0. The dependency for Apache Flink 1.18 will be the following:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kinesis</artifactId> 
    <version>4.2.0-1.18</version>
</dependency>

In the following sections, we discuss more of the powerful new features now available in Apache Flink 1.18 and supported in Amazon Managed Service for Apache Flink.

SQL

In Apache Flink SQL, users can provide hints to join queries that can be used to suggest the optimizer to have an effect in the query plan. In particular, in streaming applications, lookup joins are used to enrich a table, representing streaming data, with data that is queried from an external system, typically a database. Since version 1.16, several improvements have been introduced for lookup joins, allowing you to adjust the behavior of the join and improve performance:

  • Lookup cache is a powerful feature, allowing you to cache in-memory the most frequently used records, reducing the pressure on the database. Previously, lookup cache was specific to some connectors. Since Apache Flink 1.16, this option has become available to all connectors internally supporting lookup (FLIP-221). As of this writing, JDBC, Hive, and HBase connectors support lookup cache. Lookup cache has three available modes: FULL, for a small dataset that can be held entirely in memory, PARTIAL, for a large dataset, only caching the most recent records, or NONE, to completely disable cache. For PARTIAL cache, you can also configure the number of rows to buffer and the time-to-live.
  • Async lookup is another feature that can greatly improve performance. Async lookup provides in Apache Flink SQL a functionality similar to Async I/O available in the DataStream API. It allows Apache Flink to emit new requests to the database without blocking the processing thread until responses to previous lookups have been received. Similarly to Async I/O, you can configure async lookup to enforce ordering or allow unordered results, or adjust the buffer capacity and the timeout.
  • You can also configure a lookup retry strategy in combination with PARTIAL or NONE lookup cache, to configure the behavior in case of a failed lookup in the external database.

All these behaviors can be controlled using a LOOKUP hint, like in the following example, where we show a lookup join using async lookup:

SELECT 
    /*+ LOOKUP('table'='Customers', 'async'='true', 'output-mode'='allow_unordered') */ 
    O.order_id, O.total, C.address
FROM Orders AS O 
JOIN Customers FOR SYSTEM_TIME AS OF O.proc_time AS C 
  ON O.customer_id = O.customer_id

PyFlink

In this section, we discuss new improvements and support in PyFlink.

Python 3.10 support

Apache Flink newest versions introduced several improvements for PyFlink users. First and foremost, Python 3.10 is now supported, and Python 3.6 support has been completely removed (FLINK-29421). Managed Service for Apache Flink currently uses Python 3.10 runtime to run PyFlink applications.

Getting closer to feature parity

From the perspective of the programming API, PyFlink is getting closer to Java on every version. The DataStream API now supports features like side outputs and broadcast state, and gaps on windowing API have been closed. PyFlink also now supports new connectors like Amazon Kinesis Data Streams directly from the DataStream API.

Thread mode improvements

PyFlink is very efficient. The overhead of running Flink API operators in PyFlink is minimal compared to Java or Scala, because the runtime actually runs the operator implementation in the JVM directly, regardless of the language of your application. But when you have a user-defined function, things are slightly different. A line of Python code as simple as lambda x: x + 1, or as complex as a Pandas function, must run in a Python runtime.

By default, Apache Flink runs a Python runtime on each Task Manager, external to the JVM. Each record is serialized, handed to the Python runtime via inter-process communication, deserialized, and processed in the Python runtime. The result is then serialized and handed back to the JVM, where it’s deserialized. This is the PyFlink PROCESS mode. It’s very stable but it introduces an overhead, and in some cases, it may become a performance bottleneck.

Since version 1.15, Apache Flink also supports THREAD mode for PyFlink. In this mode, Python user-defined functions are run within the JVM itself, removing the serialization/deserialization and inter-process communication overhead. THREAD mode has some limitations; for example, THREAD mode cannot be used for Pandas or UDAFs (user-defined aggregate functions, consisting of many input records and one output record), but can substantially improve performance of a PyFlink application.

With version 1.16, the support of THREAD mode has been substantially extended, also covering the Python DataStream API.

THREAD mode is supported by Managed Service for Apache Flink, and can be enabled directly from your PyFlink application.

Apple Silicon support

If you use Apple Silicon-based machines to develop PyFlink applications, developing for PyFlink 1.15, you have probably encountered some of the known Python dependency issues on Apple Silicon. These issues have been finally resolved (FLINK-25188). These limitations did not affect PyFlink applications running on Managed Service for Apache Flink. Before version 1.16, if you wanted to develop a PyFlink application on a machine using M1, M2, or M3 chipset, you had to use some workarounds, because it was impossible to install PyFlink 1.15 or earlier directly on the machine.

Unaligned checkpoint improvements

Apache Flink 1.15 already supported Incremental Checkpoints and Buffer Debloating. These features can be used, particularly in combination, to improve checkpoint performance, making checkpointing duration more predictable, especially in the presence of backpressure. For more information about these features, see Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints.

With versions 1.16 and 1.17, several changes have been introduced to improve stability and performance.

Handling data skew

Apache Flink uses watermarks to support event-time semantics. Watermarks are special records, normally injected in the flow from the source operator, that mark the progress of event time for operators like event time windowing aggregations. A common technique is delaying watermarks from the latest observed event time, to allow events to be out of order, at least to some degree.

However, the use of watermarks comes with a challenge. When the application has multiple sources, for example it receives events from multiple partitions of a Kafka topic, watermarks are generated independently for each partition. Internally, each operator always waits for the same watermark on all input partitions, practically aligning it on the slowest partition. The drawback is that if one of the partitions is not receiving data, watermarks don’t progress, increasing the end-to-end latency. For this reason, an optional idleness timeout has been introduced in many streaming sources. After the configured timeout, watermark generation ignores any partition not receiving any record, and watermarks can progress.

You can also face a similar but opposite challenge if one source is receiving events much faster than the others. Watermarks are aligned to the slowest partition, meaning that any windowing aggregation will wait for the watermark. Records from the fast source have to wait, being buffered. This may result in buffering an excessive volume of data, and an uncontrollable growth of operator state.

To address the issue of faster sources, starting with Apache Flink 1.17, you can enable watermark alignment of source splits (FLINK-28853). This mechanism, disabled by default, makes sure that no partitions progress their watermarks too fast, compared to other partitions. You can bind together multiple sources, like multiple input topics, assigning the same alignment group ID, and configuring the duration of the maximal drift from the current watermark. If one specific partition is receiving events too fast, the source operator pauses consuming that partition until the drift is reduced below the configured threshold.

You can enable it for each source separately. All you need is to specify an alignment group ID, which will bind together all sources that have the same ID, and the duration of the maximal drift from the current minimal watermark. This will pause consuming from the source subtask that are advancing too fast, until the drift is lower than the threshold specified.

The following code snippet shows how you can set up watermark alignment of source splits on a Kafka source emitting bounded-out-of-orderness watermarks:

KafkaSource<Event> kafkaSource = ...
DataStream<Event> stream = env.fromSource(
    kafkaSource,
    WatermarkStrategy.<Event>forBoundedOutOfOrderness( Duration.ofSeconds(20))
        .withWatermarkAlignment("alignment-group-1", Duration.ofSeconds(20), Duration.ofSeconds(1)),
    "Kafka source"));

This feature is only available with FLIP-217 compatible sources, supporting watermark alignment of source splits. As of writing, among major streaming source connectors, only Kafka source supports this feature.

Direct support for Protobuf format

The SQL and Table APIs now directly support Protobuf format. To use this format, you need to generate the Protobuf Java classes from the .proto schema definition files and include them as dependencies in your application.

The Protobuf format only works with the SQL and Table APIs and only to read or write Protobuf-serialized data from a source or to a sink. Currently, Flink doesn’t directly support Protobuf to serialize state directly and it doesn’t support schema evolution, as it does for Avro, for example. You still need to register a custom serializer with some overhead for your application.

Keeping Apache Flink open source

Apache Flink internally relies on Akka for sending data between subtasks. In 2022, Lightbend, the company behind Akka, announced a license change for future Akka versions, from Apache 2.0 to a more restrictive license, and that Akka 2.6, the version used by Apache Flink, would not receive any further security update or fix.

Although Akka has been historically very stable and doesn’t require frequent updates, this license change represented a risk for the Apache Flink project. The decision of the Apache Flink community was to replace Akka with a fork of the version 2.6, called Apache Pekko (FLINK-32468). This fork will retain the Apache 2.0 license and receive any required updates by the community. In the meantime, the Apache Flink community will consider whether to remove the dependency on Akka or Pekko completely.

State compression

Apache Flink offers optional compression (default: off) for all checkpoints and savepoints. Apache Flink identified a bug in Flink 1.18.1 where the operator state couldn’t be properly restored when snapshot compression is enabled. This could result in either data loss or inability to restore from checkpoint. To resolve this, Managed Service for Apache Flink has backported the fix that will be included in future versions of Apache Flink.

In-place version upgrades with Managed Service for Apache Flink

If you are currently running an application on Managed Service for Apache Flink using Apache Flink 1.15 or older, you can now upgrade it in-place to 1.18 without losing the state, using the AWS Command Line Interface (AWS CLI), AWS CloudFormation or AWS Cloud Development Kit (AWS CDK), or any tool that uses the AWS API.

The UpdateApplication API action now supports updating the Apache Flink runtime version of an existing Managed Service for Apache Flink application. You can use UpdateApplication directly on a running application.

Before proceeding with the in-place update, you need to verify and update the dependencies included in your application, making sure they are compatible with the new Apache Flink version. In particular, you need to update any Apache Flink library, connectors, and possibly Scala version.

Also, we recommend testing the updated application before proceeding with the update. We recommend testing locally and in a non-production environment, using the target Apache Flink runtime version, to ensure no regressions were introduced.

And finally, if your application is stateful, we recommend taking a snapshot of the running application state. This will enable you to roll back to the previous application version.

When you’re ready, you can now use the UpdateApplication API action or update-application AWS CLI command to update the runtime version of the application and point it to the new application artifact, JAR, or zip file, with the updated dependencies.

For more detailed information about the process and the API, refer to In-place version upgrade for Apache Flink. The documentation includes a step by step instructions and a video to guide you through the upgrade process.

Conclusions

In this post, we examined some of the new features of Apache Flink, supported in Amazon Managed Service for Apache Flink. This list is not comprehensive. Apache Flink also introduced some very promising features, like operator-level TTL for the SQL and Table API [FLIP-292] and Time Travel [FLIP-308], but these are not yet supported by the API, and not really accessible to users yet. For this reason, we decided not to cover them in this post.

With the support of Apache Flink 1.18, Managed Service for Apache Flink now supports the latest released Apache Flink version. We have seen some of the interesting new features and new connectors available with Apache Flink 1.18 and how Managed Service for Apache Flink helps you upgrade an existing application in place.

You can find more details about recent releases from the Apache Flink blog and release notes:

If you are new to Apache Flink, we recommend our guide to choosing the right API and language and following the getting started guide to start using Managed Service for Apache Flink.


About the Authors

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working in the finance industry both through consultancies and for FinTech product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.

Francisco MorilloFrancisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon MSK and Amazon Managed Service for Apache Flink.

Measure performance of AWS Glue Data Quality for ETL pipelines

Post Syndicated from Ruben Afonso original https://aws.amazon.com/blogs/big-data/measure-performance-of-aws-glue-data-quality-for-etl-pipelines/

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. AWS Glue Data Quality reduces the effort required to validate data from days to hours, and provides computing recommendations, statistics, and insights about the resources required to run data validation.

AWS Glue Data Quality is built on DeeQu, an open source tool developed and used at Amazon to calculate data quality metrics and verify data quality constraints and changes in the data distribution so you can focus on describing how data should look instead of implementing algorithms.

In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset. As part of the results, we show how AWS Glue Data Quality provides information about the runtime of extract, transform, and load (ETL) jobs, the resources measured in terms of data processing units (DPUs), and how you can track the cost of running AWS Glue Data Quality for ETL pipelines by defining custom cost reporting in AWS Cost Explorer.

This post is Part 6 of a six-part series of posts to explain how AWS Glue Data Quality works.

Check out the other posts in the series:

Solution overview

We start by defining our test dataset in order to explore how AWS Glue Data Quality automatically scales depending on input datasets.

Dataset details

The test dataset contains 104 columns and 1 million rows stored in Parquet format. You can download the dataset or recreate it locally using the Python script provided in the repository. If you opt to run the generator script, you need to install the Pandas and Mimesis packages in your Python environment:

pip install pandas mimesis

The dataset schema is a combination of numerical, categorical, and string variables in order to have enough attributes to use a combination of built-in AWS Glue Data Quality rule types. The schema replicates some of the most common attributes found in financial market data such as instrument ticker, traded volumes, and pricing forecasts.

Data quality rulesets

We categorize some of the built-in AWS Glue Data Quality rule types to define the benchmark structure. The categories consider whether the rules perform column checks that don’t require row-level inspection (simple rules), row-by-row analysis (medium rules), or data type checks, eventually comparing row values against other data sources (complex rules). The following table summarizes these rules.

Simple Rules Medium Rules Complex Rules
ColumnCount DistinctValuesCount ColumnValues
ColumnDataType IsComplete Completeness
ColumnExist Sum ReferentialIntegrity
ColumnNamesMatchPattern StandardDeviation ColumnCorrelation
RowCount Mean RowCountMatch
ColumnLength . .

We define eight different AWS Glue ETL jobs where we run the data quality rulesets. Each job has a different number of data quality rules associated to it. Each job also has an associated user-defined cost allocation tag that we use to create a data quality cost report in AWS Cost Explorer later on.

We provide the plain text definition for each ruleset in the following table.

Job name Simple Rules Medium Rules Complex Rules Number of Rules Tag Definition
ruleset-0 0 0 0 0 dqjob:rs0
ruleset-1 0 0 1 1 dqjob:rs1 Link
ruleset-5 3 1 1 5 dqjob:rs5 Link
ruleset-10 6 2 2 10 dqjob:rs10 Link
ruleset-50 30 10 10 50 dqjob:rs50 Link
ruleset-100 50 30 20 100 dqjob:rs100 Link
ruleset-200 100 60 40 200 dqjob:rs200 Link
ruleset-400 200 120 80 400 dqjob:rs400 Link

Create the AWS Glue ETL jobs containing the data quality rulesets

We upload the test dataset to Amazon Simple Storage Service (Amazon S3) and also two additional CSV files that we’ll use to evaluate referential integrity rules in AWS Glue Data Quality (isocodes.csv and exchanges.csv) after they have been added to the AWS Glue Data Catalog. Complete the following steps:

  1. On the Amazon S3 console, create a new S3 bucket in your account and upload the test dataset.
  2. Create a folder in the S3 bucket called isocodes and upload the isocodes.csv file.
  3. Create another folder in the S3 bucket called exchange and upload the exchanges.csv file.
  4. On the AWS Glue console, run two AWS Glue crawlers, one for each folder to register the CSV content in AWS Glue Data Catalog (data_quality_catalog). For instructions, refer to Adding an AWS Glue Crawler.

The AWS Glue crawlers generate two tables (exchanges and isocodes) as part of the AWS Glue Data Catalog.

AWS Glue Data Catalog

Now we will create the AWS Identity and Access Management (IAM) role that will be assumed by the ETL jobs at runtime:

  1. On the IAM console, create a new IAM role called AWSGlueDataQualityPerformanceRole
  2. For Trusted entity type, select AWS service.
  3. For Service or use case, choose Glue.
  4. Choose Next.

AWS IAM trust entity selection

  1. For Permission policies, enter AWSGlueServiceRole
  2. Choose Next.
    AWS IAM add permissions policies
  3. Create and attach a new inline policy (AWSGlueDataQualityBucketPolicy) with the following content. Replace the placeholder with the S3 bucket name you created earlier:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "s3:GetObject",
          "Resource": [
            "arn:aws:s3:::<your_Amazon_S3_bucket_name>/*"
          ]
        }
      ]
    }

Next, we create one of the AWS Glue ETL jobs, ruleset-5.

  1. On the AWS Glue console, under ETL jobs in the navigation pane, choose Visual ETL.
  2. In the Create job section, choose Visual ETL.x
    Overview of available jobs in AWS Glue Studio
  3. In the Visual Editor, add a Data Source – S3 Bucket source node:
    1. For S3 URL, enter the S3 folder containing the test dataset.
    2. For Data format, choose Parquet.

    Overview of Amazon S3 data source in AWS Glue Studio

  4. Create a new action node, Transform: Evaluate-Data-Catalog:
  5. For Node parents, choose the node you created.
  6. Add the ruleset-5 definition under Ruleset editor.
    Data quality rules for ruleset-5
  7. Scroll to the end and under Performance Configuration, enable Cache Data.

Enable Cache data option

  1. Under Job details, for IAM Role, choose AWSGlueDataQualityPerformanceRole.
    Select previously created AWS IAM role
  2. In the Tags section, define dqjob tag as rs5.

This tag will be different for each of the data quality ETL jobs; we use them in AWS Cost Explorer to review the ETL jobs cost.

Define dqjob tag for ruleset-5 job

  1. Choose Save.
  2. Repeat these steps with the rest of the rulesets to define all the ETL jobs.

Overview of jobs defined in AWS Glue Studio

Run the AWS Glue ETL jobs

Complete the following steps to run the ETL jobs:

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. Select the ETL job and choose Run job.
  3. Repeat for all the ETL jobs.

Select one AWS Glue job and choose Run Job on the top right

When the ETL jobs are complete, the Job run monitoring page will display the job details. As shown in the following screenshot, a DPU hours column is provided for each ETL job.

Overview of AWS Glue jobs monitoring

Review performance

The following table summarizes the duration, DPU hours, and estimated costs from running the eight different data quality rulesets over the same test dataset. Note that all rulesets have been run with the entire test dataset described earlier (104 columns, 1 million rows).

ETL Job Name Number of Rules Tag Duration (sec) # of DPU hours # of DPUs Cost ($)
ruleset-400 400 dqjob:rs400 445.7 1.24 10 $0.54
ruleset-200 200 dqjob:rs200 235.7 0.65 10 $0.29
ruleset-100 100 dqjob:rs100 186.5 0.52 10 $0.23
ruleset-50 50 dqjob:rs50 155.2 0.43 10 $0.19
ruleset-10 10 dqjob:rs10 152.2 0.42 10 $0.18
ruleset-5 5 dqjob:rs5 150.3 0.42 10 $0.18
ruleset-1 1 dqjob:rs1 150.1 0.42 10 $0.18
ruleset-0 0 dqjob:rs0 53.2 0.15 10 $0.06

The cost of evaluating an empty ruleset is close to zero, but it has been included because it can be used as a quick test to validate the IAM roles associated to the AWS Glue Data Quality jobs and read permissions to the test dataset in Amazon S3. The cost of data quality jobs only starts to increase after evaluating rulesets with more than 100 rules, remaining constant below that number.

We can observe that the cost of running data quality for the largest ruleset in the benchmark (400 rules) is still slightly above $0.50.

Data quality cost analysis in AWS Cost Explorer

In order to see the data quality ETL job tags in AWS Cost Explorer, you need to activate the user-defined cost allocation tags first.

After you create and apply user-defined tags to your resources, it can take up to 24 hours for the tag keys to appear on your cost allocation tags page for activation. It can then take up to 24 hours for the tag keys to activate.

  1. On the AWS Cost Explorer console, choose Cost Explorer Saved Reports in the navigation pane.
  2. Choose Create new report.
    Create new AWS Cost Explorer report
  3. Select Cost and usage as the report type.
  4. Choose Create Report.
    Confirm creation of a new AWS Cost Explorer report
  5. For Date Range, enter a date range.
  6. For Granularity¸ choose Daily.
  7. For Dimension, choose Tag, then choose the dqjob tag.
    Report parameter selection in AWS Cost Explorer
  8. Under Applied filters, choose the dqjob tag and the eight tags used in the data quality rulesets (rs0, rs1, rs5, rs10, rs50, rs100, rs200, and rs400).
    Select the eight tags used to tag the data quality AWS Glue jobs
  9. Choose Apply.

The Cost and Usage report will be updated. The X-axis shows the data quality ruleset tags as categories. The Cost and usage graph in AWS Cost Explorer will refresh and show the total monthly cost of the latest data quality ETL jobs run, aggregated by ETL job.

The AWS Cost Explorer report shows the costs associated to executing the data quality AWS Glue Studio jobs

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

  1. Empty the S3 bucket initially created to store the test dataset.
  2. Delete the ETL jobs you created in AWS Glue.
  3. Delete the AWSGlueDataQualityPerformanceRole IAM role.
  4. Delete the custom report created in AWS Cost Explorer.

Conclusion

AWS Glue Data Quality provides an efficient way to incorporate data quality validation as part of ETL pipelines and scales automatically to accommodate increasing volumes of data. The built-in data quality rule types offer a wide range of options to customize the data quality checks and focus on how your data should look instead of implementing undifferentiated logic.

In this benchmark analysis, we showed how common-size AWS Glue Data Quality rulesets have little or no overhead, whereas in complex cases, the cost increases linearly. We also reviewed how you can tag AWS Glue Data Quality jobs to make cost information available in AWS Cost Explorer for quick reporting.

AWS Glue Data Quality is generally available in all AWS Regions where AWS Glue is available. Learn more about AWS Glue Data Quality and AWS Glue Data Catalog in Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog.


About the Authors


Ruben Afonso Francos
Ruben Afonso is a Global Financial Services Solutions Architect with AWS. He enjoys working on analytics and AI/ML challenges, with a passion for automation and optimization. When not at work, he enjoys finding hidden spots off the beaten path around Barcelona.


Kalyan Kumar Neelampudi (KK)
Kalyan Kumar Neelampudi (KK)
is a Specialist Partner Solutions Architect (Data Analytics & Generative AI) at AWS. He acts as a technical advisor and collaborates with various AWS partners to design, implement, and build practices around data analytics and AI/ML workloads. Outside of work, he’s a badminton enthusiast and culinary adventurer, exploring local cuisines and traveling with his partner to discover new tastes and experiences.

Gonzalo Herreros
Gonzalo Herreros
is a Senior Big Data Architect on the AWS Glue team.

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Post Syndicated from Edvin Hallvaxhiu original https://aws.amazon.com/blogs/big-data/build-a-pseudonymization-service-on-aws-to-protect-sensitive-data-part-2/

Part 1 of this two-part series described how to build a pseudonymization service that converts plain text data attributes into a pseudonym or vice versa. A centralized pseudonymization service provides a unique and universally recognized architecture for generating pseudonyms. Consequently, an organization can achieve a standard process to handle sensitive data across all platforms. Additionally, this takes away any complexity and expertise needed to understand and implement various compliance requirements from development teams and analytical users, allowing them to focus on their business outcomes.

Following a decoupled service-based approach means that, as an organization, you are unbiased towards the use of any specific technologies to solve your business problems. No matter which technology is preferred by individual teams, they are able to call the pseudonymization service to pseudonymize sensitive data.

In this post, we focus on common extract, transform, and load (ETL) consumption patterns that can use the pseudonymization service. We discuss how to use the pseudonymization service in your ETL jobs on Amazon EMR (using Amazon EMR on EC2) for streaming and batch use cases. Additionally, you can find an Amazon Athena and AWS Glue based consumption pattern in the GitHub repo of the solution.

Solution overview

The following diagram describes the solution architecture.

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series.

The account on the left is the one that you set up as part of this post, representing the ETL platform based on Amazon EMR using the pseudonymization service.

You can deploy the pseudonymization service and the ETL platform on the same account.

Amazon EMR empowers you to create, operate, and scale big data frameworks such as Apache Spark quickly and cost-effectively.

In this solution, we show how to consume the pseudonymization service on Amazon EMR with Apache Spark for batch and streaming use cases. The batch application reads data from an Amazon Simple Storage Service (Amazon S3) bucket, and the streaming application consumes records from Amazon Kinesis Data Streams.

PySpark code used in batch and streaming jobs

Both applications use a common utility function that makes HTTP POST calls against the API Gateway that is linked to the pseudonymization AWS Lambda function. The REST API calls are made per Spark partition using the Spark RDD mapPartitions function. The POST request body contains the list of unique values for a given input column. The POST request response contains the corresponding pseudonymized values. The code swaps the sensitive values with the pseudonymized ones for a given dataset. The result is saved to Amazon S3 and the AWS Glue Data Catalog, using Apache Iceberg table format.

Iceberg is an open table format that supports ACID transactions, schema evolution, and time travel queries. You can use these features to implement the right to be forgotten (or data erasure) solutions using SQL statements or programming interfaces. Iceberg is supported by Amazon EMR starting with version 6.5.0, AWS Glue, and Athena. Batch and streaming patterns use Iceberg as their target format. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Prerequisites

You must have the following prerequisites:

  • An AWS account.
  • An AWS Identity and Access Management (IAM) principal with privileges to deploy the AWS CloudFormation stack and related resources.
  • The AWS Command Line Interface (AWS CLI) installed on the development or deployment machine that you will use to run the provided scripts.
  • An S3 bucket in the same account and AWS Region where the solution is to be deployed.
  • Python3 installed in the local machine where the commands are run.
  • PyYAML installed using pip.
  • A bash terminal to run bash scripts that deploy CloudFormation stacks.
  • An additional S3 bucket containing the input dataset in Parquet files (only for batch applications). Copy the sample dataset to the S3 bucket.
  • A copy of the latest code repository in the local machine using git clone or the download option.

Open a new bash terminal and navigate to the root folder of the cloned repository.

The source code for the proposed patterns can be found in the cloned repository. It uses the following parameters:

  • ARTEFACT_S3_BUCKET – The S3 bucket where the infrastructure code will be stored. The bucket must be created in the same account and Region where the solution lives.
  • AWS_REGION – The Region where the solution will be deployed.
  • AWS_PROFILE – The named profile that will be applied to the AWS CLI command. This should contain credentials for an IAM principal with privileges to deploy the CloudFormation stack of related resources.
  • SUBNET_ID – The subnet ID where the EMR cluster will be spun up. The subnet is pre-existing and for demonstration purposes, we use the default subnet ID of the default VPC.
  • EP_URL – The endpoint URL of the pseudonymization service. Retrieve this from the solution deployed as Part 1 of this series.
  • API_SECRET – An Amazon API Gateway key that will be stored in AWS Secrets Manager. The API key is generated from the deployment depicted in Part 1 of this series.
  • S3_INPUT_PATH – The S3 URI pointing to the folder containing the input dataset as Parquet files.
  • KINESIS_DATA_STREAM_NAMEThe Kinesis data stream name deployed with the CloudFormation stack.
  • BATCH_SIZEThe number of records to be pushed to the data stream per batch.
  • THREADS_NUM The number of parallel threads used in the local machine to upload data to the data stream. More threads correspond to a higher message volume.
  • EMR_CLUSTER_ID – The EMR cluster ID where the code will be run (the EMR cluster was created by the CloudFormation stack).
  • STACK_NAME – The name of the CloudFormation stack, which is assigned in the deployment script.

Batch deployment steps

As described in the prerequisites, before you deploy the solution, upload the Parquet files of the test dataset to Amazon S3. Then provide the S3 path of the folder containing the files as the parameter <S3_INPUT_PATH>.

We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_1.sh script, which is inside the deployment_scripts folder.

After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:

sh ./deployment_scripts/deploy_1.sh \
-a <ARTEFACT_S3_BUCKET> \
-r <AWS_REGION> \
-p <AWS_PROFILE> \
-s <SUBNET_ID> \
-e <EP_URL> \
-x <API_SECRET> \
-i <S3_INPUT_PATH>

The output should look like the following screenshot.

The required parameters for the cleanup command are printed out at the end of the run of the deploy_1.sh script. Make sure to note down these values.

Test the batch solution

In the CloudFormation template deployed using the deploy_1.sh script, the EMR step containing the Spark batch application is added at the end of the EMR cluster setup.

To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.

You can also use Athena to query the table pseudo_table in the database blog_batch_db.

Clean up batch resources

To destroy the resources created as part of this exercise,

in a bash terminal, navigate to the root folder of the cloned repository. Enter the cleanup command shown as the output of the previously run deploy_1.sh script:

sh ./deployment_scripts/cleanup_1.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>

The output should look like the following screenshot.

Streaming deployment steps

We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_2.sh script, which is inside the deployment_scripts folder. The CloudFormation stack template for this pattern is available in the GitHub repo.

After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:

sh deployment_scripts/deploy_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-r <AWS_REGION> \
-p <AWS_PROFILE> \
-s <SUBNET_ID> \
-e <EP_URL> \
-x <API_SECRET>

The output should look like the following screenshot.

The required parameters for the cleanup command are printed out at the end of the output of the deploy_2.sh script. Make sure to save these values to use later.

Test the streaming solution

In the CloudFormation template deployed using the deploy_2.sh script, the EMR step containing the Spark streaming application is added at the end of the EMR cluster setup. To test the end-to-end pipeline, you need to push records to the deployed Kinesis data stream. With the following commands in a bash terminal, you can activate a Kinesis producer that will continuously put records in the stream, until the process is manually stopped. You can control the producer’s message volume by modifying the BATCH_SIZE and the THREADS_NUM variables.

python3 -m pip install kiner
python3 \
consumption-patterns/emr/1_pyspark-streaming/kinesis_producer/producer.py \
<KINESIS_DATA_STREAM_NAME> \
<BATCH_SIZE> \
<THREADS_NUM>

To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.

In the Athena query editor, check the results by querying the table pseudo_table in the database blog_stream_db.

Clean up streaming resources

To destroy the resources created as part of this exercise, complete the following steps:

  1. Stop the Python Kinesis producer that was launched in a bash terminal in the previous section.
  2. Enter the following command:
sh ./deployment_scripts/cleanup_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>

The output should look like the following screenshot.

Performance details

Use cases might differ in requirements with respect to data size, compute capacity, and cost. We have provided some benchmarking and factors that may influence performance; however, we strongly advise you to validate the solution in lower environments to see if it meets your particular requirements.

You can influence the performance of the proposed solution (which aims to pseudonymize a dataset using Amazon EMR) by the maximum number of parallel calls to the pseudonymization service and the payload size for each call. In terms of parallel calls, factors to consider are the GetSecretValue calls limit from Secrets Manager (10.000 per second, hard limit) and the Lambda default concurrency parallelism (1,000 by default; can be increased by quota request). You can control the maximum parallelism adjusting the number of executors, the number of partitions composing the dataset, and the cluster configuration (number and type of nodes). In terms of payload size for each call, factors to consider are the API Gateway maximum payload size (6 MB) and the Lambda function maximum runtime (15 minutes). You can control the payload size and the Lambda function runtime by adjusting the batch size value, which is a parameter of the PySpark script that determines the number of items to be pseudonymized per each API call. To capture the influence of all these factors and assess the performance of the consumption patterns using Amazon EMR, we have designed and monitored the following scenarios.

Batch consumption pattern performance

To assess the performance for the batch consumption pattern, we ran the pseudonymization application with three input datasets composed of 1, 10, and 100 Parquet files of 97.7 MB each. We generated the input files using the dataset_generator.py script.

The cluster capacity nodes were 1 primary (m5.4xlarge) and 15 core (m5d.8xlarge). This cluster configuration remained the same for all three scenarios, and it allowed the Spark application to use up to 100 executors. The batch_size, which was also the same for the three scenarios, was set to 900 VINs per API call, and the maximum VIN size was 5 bytes.

The following table captures the information of the three scenarios.

Execution ID Repartition Dataset Size Number of Executors Cores per Executor Executor Memory Runtime
A 800 9.53 GB 100 4 4 GiB 11 minutes, 10 seconds
B 80 0.95 GB 10 4 4 GiB 8 minutes, 36 seconds
C 8 0.09 GB 1 4 4 GiB 7 minutes, 56 seconds

As we can see, properly parallelizing the calls to our pseudonymization service enables us to control the overall runtime.

In the following examples, we analyze three important Lambda metrics for the pseudonymization service: Invocations, ConcurrentExecutions, and Duration.

The following graph depicts the Invocations metric, with the statistic SUM in orange and RUNNING SUM in blue.

By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during each run.

Run ID Dataset Size Total Invocations
A 9.53 GB 1.467.000 – 0 = 1.467.000
B 0.95 GB 1.467.000 – 1.616.500 = 149.500
C 0.09 GB 1.616.500 – 1.631.000 = 14.500

As expected, the number of invocations increases proportionally by 10 with the dataset size.

The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.

The application is designed such that the maximum number of concurrent Lambda function runs is given by the amount of Spark tasks (Spark dataset partitions), which can be processed in parallel. This number can be calculated as MIN (executors x executor_cores, Spark dataset partitions).

In the test, run A processed 800 partitions, using 100 executors with four cores each. This makes 400 tasks processed in parallel so the Lambda function concurrent runs can’t be above 400. The same logic was applied for runs B and C. We can see this reflected in the preceding graph, where the amount of concurrent runs never surpasses the 400, 40, and 4 values.

To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. If that is the case, you should either increase the Lambda function concurrency limit (if you want to keep up the performance) or reduce either the amount of partitions or the number of available executors (impacting the application performance).

The following graph depicts the Lambda Duration metric, with the statistic AVG in orange and MAX in green.

As expected, the size of the dataset doesn’t affect the duration of the pseudonymization function run, which, apart from some initial invocations facing cold starts, remains constant to an average of 3 milliseconds throughout the three scenarios. This because the maximum number of records included in each pseudonymization call is constant (batch_size value).

Lambda is billed based on the number of invocations and the time it takes for your code to run (duration). You can use the average duration and invocations metrics to estimate the cost of the pseudonymization service.

Streaming consumption pattern performance

To assess the performance for the streaming consumption pattern, we ran the producer.py script, which defines a Kinesis data producer that pushes records in batches to the Kinesis data stream.

The streaming application was left running for 15 minutes and it was configured with a batch_interval of 1 minute, which is the time interval at which streaming data will be divided into batches. The following table summarizes the relevant factors.

Repartition Cluster Capacity Nodes Number of Executors Executor’s Memory Batch Window Batch Size VIN Size
17

1 Primary (m5.xlarge),

3 Core (m5.2xlarge)

6 9 GiB 60 seconds 900 VINs/API call. 5 Bytes / VIN

The following graphs depict the Kinesis Data Streams metrics PutRecords (in blue) and GetRecords (in orange) aggregated with 1-minute period and using the statistic SUM. The first graph shows the metric in bytes, which peaks 6.8 MB per minute. The second graph shows the metric in record count peaking at 85,000 records per minute.

We can see that the metrics GetRecords and PutRecords have overlapping values for almost the entire application’s run. This means that the streaming application was able to keep up with the load of the stream.

Next, we analyze the relevant Lambda metrics for the pseudonymization service: Invocations, ConcurrentExecutions, and Duration.

The following graph depicts the Invocations metric, with the statistic SUM (in orange) and RUNNING SUM in blue.

By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during the run. In specific, in 15 minutes, the streaming application invoked the pseudonymization API 977 times, which is around 65 calls per minute.

The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.

The repartition and the cluster configuration allow the application to process all Spark RDD partitions in parallel. As a result, the concurrent runs of the Lambda function are always equal to or below the repartition number, which is 17.

To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. For this aspect, the same suggestions as for the batch use case are valid.

The following graph depicts the Lambda Duration metric, with the statistic AVG in blue and MAX in orange.

As expected, aside the Lambda function’s cold start, the average duration of the pseudonymization function was more or less constant throughout the run. This because the batch_size value, which defines the number of VINs to pseudonymize per call, was set to and remained constant at 900.

The ingestion rate of the Kinesis data stream and the consumption rate of our streaming application are factors that influence the number of API calls made against the pseudonymization service and therefore the related cost.

The following graph depicts the Lambda Invocations metric, with the statistic SUM in orange, and the Kinesis Data Streams GetRecords.Records metric, with the statistic SUM in blue. We can see that there is correlation between the amount of records retrieved from the stream per minute and the amount of Lambda function invocations, thereby impacting the cost of the streaming run.

In addition to the batch_interval, we can control the streaming application’s consumption rate using Spark streaming properties like spark.streaming.receiver.maxRate and spark.streaming.blockInterval. For more details, refer to Spark Streaming + Kinesis Integration and Spark Streaming Programming Guide.

Conclusion

Navigating through the rules and regulations of data privacy laws can be difficult. Pseudonymization of PII attributes is one of many points to consider while handling sensitive data.

In this two-part series, we explored how you can build and consume a pseudonymization service using various AWS services with features to assist you in building a robust data platform. In Part 1, we built the foundation by showing how to build a pseudonymization service. In this post, we showcased the various patterns to consume the pseudonymization service in a cost-efficient and performant manner. Check out the GitHub repository for additional consumption patterns.


About the Authors

Edvin Hallvaxhiu is a Senior Global Security Architect with AWS Professional Services and is passionate about cybersecurity and automation. He helps customers build secure and compliant solutions in the cloud. Outside work, he likes traveling and sports.

Rahul Shaurya is a Principal Big Data Architect with AWS Professional Services. He helps and works closely with customers building data platforms and analytical applications on AWS. Outside of work, Rahul loves taking long walks with his dog Barney.

Andrea Montanari is a Senior Big Data Architect with AWS Professional Services. He actively supports customers and partners in building analytics solutions at scale on AWS.

María Guerra is a Big Data Architect with AWS Professional Services. Maria has a background in data analytics and mechanical engineering. She helps customers architecting and developing data related workloads in the cloud.

Pushpraj Singh is a Senior Data Architect with AWS Professional Services. He is passionate about Data and DevOps engineering. He helps customers build data driven applications at scale.

How to access AWS resources from Microsoft Entra ID tenants using AWS Security Token Service

Post Syndicated from Vasanth Selvaraj original https://aws.amazon.com/blogs/security/how-to-access-aws-resources-from-microsoft-entra-id-tenants-using-aws-security-token-service/

Use of long-term access keys for authentication between cloud resources increases the risk of key exposure and unauthorized secrets reuse. Amazon Web Services (AWS) has developed a solution to enable customers to securely authenticate Azure resources with AWS resources using short-lived tokens to reduce risks to secure authentication.

In this post, we guide you through the configuration of AWS Identity and Access Management (IAM) OpenID Connect (OIDC) identity provider to establish trust with a Microsoft Entra ID tenant. By following the steps outlined in this post, you will enable a Microsoft Azure hosted resources to use an IAM role, with privileges, to access your AWS resources.

Solution overview

In this solution, we show you how to obtain temporary credentials in IAM. The solution uses AWS Security Token Service (AWS STS) in conjunction with Azure managed identities and Azure App Registration. This method provides a more secure and efficient way to bridge Azure and AWS clouds, providing seamless integration without compromising secure authentication and authorization standards.

Figure 1: Azure cloud resources access AWS resources with temporary security credentials

Figure 1: Azure cloud resources access AWS resources with temporary security credentials

As shown in Figure 1, the process is as follows:

  1. Create and attach an Azure managed identity to an Azure virtual machine (VM).
  2. Azure VM gets an Azure access token from the managed identity and sends it to AWS STS to retrieve temporary security credentials.
  3. An IAM role created with a valid Azure tenant audience and subject validates that the claim is sourced from a trusted entity and sends temporary security credentials to the requesting Azure VM.
  4. Azure VM accesses AWS resources using the AWS STS provided temporary security credentials.

Prerequisites

You must have the following before you begin:

  1. An AWS account.
  2. An Azure account subscription.
  3. In your Azure account, ensure there’s an existing managed identity or create a new one for testing this solution. More information can be found in Configure managed identities for Azure resources on a VM using the Azure portal.
  4. Create a VM instance in Azure and attach the managed identity that you created in Step 3.
  5. Install jq, boto3, and AWS Command Line Interface (AWS CLI) version 2 on an Azure VM for testing.

Implementation

To prepare the authentication process with Microsoft Entra ID, an enterprise application must be created in Microsoft Entra ID. This serves as a sign-in endpoint and provides the necessary user identity information through OIDC access tokens to the identity provider (IdP) of the target AWS account.

Note: You can get short term credentials by providing access tokens from managed identities or enterprise applications. This post covers the enterprise application use case.

Register a new application in Azure

  1. In the Azure portal, select Microsoft Entra ID.
  2. Select App registrations.
  3. Select New registration.
  4. Enter a name for your application and then select an option in Supported account types (in this example, we chose Accounts in this Organization directory only). Leave the other options as is. Then choose Register.
    Figure 2: Register an application in the Azure portal

    Figure 2: Register an application in the Azure portal

Configure the application ID URI

  1. In the Azure portal, select Microsoft Entra ID.
  2. Select App registrations.
  3. On the App registrations page, select All applications and choose the newly registered application.
  4. On the newly registered application’s overview page, choose Application ID URI and then select Add.
  5. On the Edit application ID URI page, enter the value of the URI, which looks like urn://<name of the application> or api://<name of the application>.
  6. The application ID URI will be used later as the audience in the identity provider(idP) section of AWS.
    Figure 3: Configure the application ID URI

    Figure 3: Configure the application ID URI

  7. Open the newly registered application’s overview page.
  8. In the navigation pane, under Manage, choose App roles.
  9. Select Create app role and then enter a Display name and for Allowed member types, select Both (Users/Groups + Applications).
  10. For Description, enter a description.
  11. Select Do you want to enable this app role? And then choose Apply.
    Figure 4: Create and enable an application role

    Figure 4: Create and enable an application role

  12. Assign a managed identity—as created in Step 4 of the prerequisites—to the new application role. This operation can only be done by either using the Azure Cloud Shell or running scripts locally by installing the latest version of the Microsoft Graph PowerShell SDK. (For more information about assigning managed identities to application roles using PowerShell, see Azure documentation.)

    You must have the following information:

    • ObjectID: To find the managed identity’s Object (Principal) ID, go to the Managed Identities page, select the identity name, and then select Overview.
      Figure 5: Find the ObjectID of the managed identity

      Figure 5: Find the ObjectID of the managed identity

    • ID: To find the ID of the application role, go to App registrations, select the application name, and then select App roles.
      Figure 6: Find the ID of the application role

      Figure 6: Find the ID of the application role

    • PrincipalID: Same as ObjectID, which is the managed identity’s Object (Principal) ID.
    • ResourceID: The ObjectID of the resource service principal, which you can find by going to the Enterprise applications page and selection the application. Select Overview and then Properties to find the ObjectID.
      Figure 7: Find the ResourceID

      Figure 7: Find the ResourceID

  13. With the resource IDs, you can now use Azure Cloud Shell and run the following script in PowerShell terminal with New-AzureADServiceAppRoleAssignment. Replace the variables with the resource IDs.
    PS /home/user> Connect-AzureAD
    PS /home/user> New-AzureADServiceAppRoleAssignment -ObjectId <ObjectID> -Id <ID> -PrincipalId <PrincipalID> -ResourceId <ResourceID>
    

Configure AWS

  1. In the AWS Management Console for IAM, create an IAM Identity Provider.
    1. In the left navigation pane, select Identity providers and then choose Add an identity provider.
    2. For Provider type, choose OpenID Connect.
    3. For Provider URL, enter https://sts.windows.net/<Microsoft Entra Tenant ID>. Replace <Microsoft Entra Tenant ID> with your Tenant ID from Azure. This allows only identities from your Azure tenant to access your AWS resources.
    4. For Audience use the client_id of the Azure managed identity or the application ID URI from enterprise applications.
      • For Audience, enter the application ID URI that you configured on step 5 of Configure the application ID URI. If you have additional client IDs (also known as audiences) for this IdP, you can add them to the provider detail page later.
      • You can also use different audiences in the role trust policy in the next step to limit the roles that specific audiences can assume. To do so, you must provide a StringEquals condition in the trust policy of the IAM role.
        Figure 8: Adding an audience (client ID)

        Figure 8: Adding an audience (client ID)

  2. Using an OIDC principal without a condition can be overly permissive. To make sure that only the intended identities assume the role, provide an audience (aud) and subject (sub) as conditions in the role trust policy for this IAM role.

    sts.windows.net/<Microsoft Entra Tenant ID>/:sub represents the identity of your Azure workload that limits access to the specific Azure identity that can assume this role from the Azure tenant. See the following example for conditions.

    • Replace <Microsoft Entra Tenant ID> with your tenant ID from Azure.
    • Replace <Application ID URI> with your audience value configured in the previous step.
    • Replace <Managed Identity’s object (Principal) ID> with your ObjectID captured in the first bullet of Step 12 of Configure the application ID URI.
    {
        “Version”: “2012-10-17”,
        “Statement”: [{
            “Effect”: “Allow”,
            “Principal”: {
                “Federated”: “arn:aws:iam::<AWS Account ID>:oidc-provider/sts.windows.net/<Microsoft Entra Tenant ID>/”
            },
            “Action”: “sts:AssumeRoleWithWebIdentity”,
            “Condition”: {
                “StringEquals”: {
                    “sts.windows.net/<Microsoft Entra Tenant ID>/:aud”: “<Application ID URI>”,
                    “sts.windows.net/<Microsoft Entra Tenant ID>/:sub”: “ <Managed Identity’s Object (Principal)ID>”
                }
            }
        }]
    }
    

Test the access

To test the access, you’ll assign a user assigned managed identity to an existing VM.

  1. Sign in to the Azure portal.
  2. Navigate to the desired VM and select Identity, User assigned, and then choose Add.
    Figure 9: Assigning a User assigned Identity

    Figure 9: Assigning a User assigned Identity

  3. Select the managed identity created as part of prerequisite and then choose Add.
    Figure 10: Add a user assigned managed identity

    Figure 10: Add a user assigned managed identity

  4. In AWS, we used credential_process in a separate AWS Config profile to dynamically and programmatically retrieve AWS temporary credentials. The credential process calls a bash script that retrieves an access token from Azure and uses the token to obtain temporary credentials from AWS STS. For the syntax and operating system requirements, see Source credentials with an external process. For this post, we created a custom profile called DevTeam-S3ReadOnlyAccess, as shown in the config file:
    [profile DevTeam-S3ReadOnlyAccess]
    credential_process = /opt/bin/credentials.sh
    region = ap-southeast-2
    

    To use different settings, you can create and reference additional profiles.

  5. For this example, credentials_process invokes the script /opt/bin/credentials.sh. Replace <111122223333> with your own account ID.
    /opt/bin/credentials.sh
    #!/bin/bash
    
    # Application ID URI from Azure
    AUDIENCE=”urn://dev-aws-account-team-a”
    # Role ARN from AWS to assume
    ROLE_ARN=”arn:aws:iam::<111122223333>:role/Azure-AWSAssumeRole”
    
    # Retrieve Access Token using Audience
    access_token=$(curl “http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=${AUDIENCE}” -H “Metadata:true” -s| jq -r ‘.access_token’)
    
    # Create credentials following JSON format required by AWS CLI
    credentials=$(aws sts assume-role-with-web-identity –role-arn ${ROLE_ARN} –web-identity-token $access_token –role-session-name AWSAssumeRole|jq ‘.Credentials’ | jq ‘.Version=1’)
    
    # Write credentials to STDOUT for AWS CLI to pick up
    echo $credentials 
    

  6. After you configure the AWS Config CLI file for the credential_process script, verify the setup by accessing AWS resources from Azure VM.
    1. Use AWS CLI to run the following command. You should see list of Amazon Simple Storage Service (Amazon S3) buckets from your account.
      aws s3 ls –profile DevTeam-S3ReadOnlyAccess
      

    2. Using AWS SDK for Python to run s3AccessFromAzure.py. You should see a list of S3 buckets from your account. This example also demonstrates specifying a profile to use for credential purposes.
      S3AccessFromAzure.py
      import boto3
      
      # Assume Role with Web Identity Provider profile
      session = boto3.Session(profile_name=’DevTeam-S3ReadOnlyAccess’)
      
      # Retrieve the list of existing buckets
      s3 = session.client(‘s3’)
      response = s3.list_buckets()
      
      # Output the bucket names
      print(‘Existing buckets:’)
      for bucket in response[‘Buckets’]:
          print(f’  {bucket[“Name”]}’)
      

Note: The AWS CLI doesn’t cache external process credentials; instead, the AWS CLI calls the credential_process for every CLI request, which creates a new role session. If you use AWS SDKs, the credentials are cached and reused until they expire.

We used Azure VM as an example to access AWS resources, but a similar approach can be used for any compute resources in Azure that are capable of issuing Azure credentials.

Clean up

If you don’t need the resources that you created for this walkthrough, delete them to avoid future charges for the deployed resources:

  • Delete the VM instance, managed identity, and enterprise applications created in Azure.
  • Delete the resources that you provisioned on AWS to test the solution.

Conclusion

In this post, we showed you how to securely access AWS resources from Azure workloads using an IAM role assumed with one-time, short-term credentials. By using this solution, your Azure workloads will request temporary security credentials and remove the need for long-term AWS credentials or other secrets usage that are less secure methods of authentication.

Use the following resources to help you get started with AWS IAM federation:

If you have feedback about this post, submit comments in the Comments section below.

Vasanth Selvaraj

Vasanth Selvaraj

Vasanth is a Senior Security Specialist Technical Account Manager based in Sydney, Australia. Vasanth has a strong passion for Cyber Security. In this role, he assists customers in defending against cyber threats and addresses their security and compliance challenges.

Sam Zhang

Sam Zhang

Sam is a Security Specialist Technical Account Manager based in Sydney, Australia. Sam possesses expertise in infrastructure security, IAM, and threat detection. Sam is dedicated to aiding enterprises in establishing secure cloud infrastructure and workloads.

AWS CloudHSM architectural considerations for crypto user credential rotation

Post Syndicated from Shankar Rajagopalan original https://aws.amazon.com/blogs/security/aws-cloudhsm-architectural-considerations-for-crypto-user-credential-rotation/

This blog post provides architectural guidance on AWS CloudHSM crypto user credential rotation and is intended for those using or considering using CloudHSM. CloudHSM is a popular solution for secure cryptographic material management. By using this service, organizations can benefit from a robust mechanism to manage their own dedicated FIPS 140-2 level 3 hardware security module (HSM) cluster in the cloud and a client SDK that enables crypto users to perform cryptographic operations on deployed HSMs.

Credential rotation is an AWS Well-Architected best practice as it helps reduce the risks associated with the use of long-term credentials. Additionally, organizations are often required to rotate crypto user credentials for their HSM clusters to meet compliance, regulatory, or industry requirements. Unlike most AWS services that use AWS Identity and Access Management (IAM) users or IAM policies to access resources within your cluster, HSM users are directly created and maintained on the HSM cluster. As a result, how the credential rotation operation is performed might impact the workload’s availability. Thus, it’s important to understand the available options to perform crypto user credential rotation and the impact each option has in terms of ease of implementation and downtime.

In this post, we dive deep into the different options, steps to implement them, and their related pros and cons. We finish with a matrix of the relative downtime, complexity, and cost of each option so you can choose which best fits your use case.

Solution overview

In this document, we consider three approaches:

Approach 1 — For a workload with a defined maintenance window. You can shut down all client connections to CloudHSM, change the crypto user’s password, and subsequently re-establish connections to CloudHSM. This option is the most straightforward, but requires some application downtime.

Approach 2 — You create an additional crypto user (with access to all cryptographic materials) with a new password and from which new client instances are deployed. When the new user and instances are in place, traffic is rerouted to the new instances through a load balancer. This option involves no downtime but requires additional infrastructure (client instances) and a process to share cryptographic material between the crypto users.

Approach 3 — You run two separate and identical environments, directing traffic to a live (blue) environment while making and testing the changes on a secondary (green) environment before redirecting traffic to the green environment. This option involves no downtime, but requires additional infrastructure (client instances and an additional CloudHSM cluster) to support the blue/green deployment strategy.

Solution prerequisites

Approach 1

The first approach uses an application’s planned maintenance window to enact necessary crypto user password changes. It’s the most straightforward of the recommended options, with the least amount of complexity because no additional infrastructure is needed to support the password rotation activity. However, it requires downtime (preferably planned) to rotate the password and update the client application instances; depending on how you deploy a client application, you can shorten the downtime by automating the application deployment process. The main steps for this approach are shown in Figure 1:

Figure 1: Approach 1 to update crypto user password

Figure 1: Approach 1 to update crypto user password

To implement approach 1:

  1. Terminate all client connections to a CloudHSM cluster. This is necessary because you cannot change a password while a crypto user’s session is active.
    1. You can query an Amazon CloudWatch log group for your CloudHSM cluster to find out if any user session is active. Additionally, you can audit Amazon Virtual Private Cloud (Amazon VPC) Flow Logs by enabling them for the elastic network interfaces (ENIs) related to the CloudHSM cluster. See where the traffic is coming from and link that to the applications.
  2. Change the crypto user password
    1. Use the following command to start CloudHSM CLI interactive mode.
      1. Windows: C:\Program Files\Amazon\CloudHSM\bin\> .\cloudhsm-cli.exe interactive
      2. Linux: $ /opt/cloudhsm/bin/cloudhsm-cli interactive
    2. Use the login command and log in as the user with the password you want to change. aws-cloudhsm > login --username <USERNAME> --role <ROLE>
    3. Enter the user’s password.
    4. Enter the user change-password command. aws-cloudhsm > user change-password --username <USERNAME> --role <ROLE>
    5. Enter the new password.
    6. Re-enter the new password.
  3. Update the client connecting to CloudHSM to use the new credentials. Follow the SDK documentation for detailed steps if you are using PKCS # 11, OpenSSL Dynamic Engine, JCE provider or KSP and CNG provider.
  4. Resume all client connections to CloudHSM cluster

Approach 2

The second approach employs two crypto users and a blue/green deployment strategy, that is, a deployment strategy in which you create two separate but identical client environments. One environment (blue) runs the current application version with crypto user 1 (CU1) and handles live traffic, while the other environment (green) runs a new application version with the updated crypto user 2 (CU2) password. After testing is complete on the green environment, traffic is directed to the green environment and the blue environment is deprecated. In this approach, both crypto users have access to the required cryptographic material. When rotating the crypto user password, you spin up new client instances and swap connection credentials to use the second crypto user. Because the client application only uses one crypto user at a time, the second user can remain dormant and be reused in the future as well. When compared to the first approach, this approach adds complexity to your architecture so that you can redirect live application traffic to the new environment by deploying additional client instances without having to restart. You also need to be aware that a shared user can only perform sign, encrypt, decrypt, verify, and HMAC operations with the shared key. Currently, export, wrap, modify, delete, and derive operations aren’t allowed with a shared user. This approach has the advantages of a classic blue/green deployment (no downtime and low risk), in addition to adding redundancy at the user management level by having multiple crypto users with access to the required cryptographic material. Figure 2 depicts a possible architecture:

Figure 2: Approach 2 to update crypto user password

Figure 2: Approach 2 to update crypto user password

To implement Approach 2:

  1. Set up two crypto users on the CloudHSM cluster, for example CU1 and CU2.
  2. Create cryptographic material required by your application.
  3. Use the key share command to share the key with the other user so that both users have access to all the keys.
    1. Start by running the key list command with a filter to return a specific key.
    2. View the shared-users output to identify whom the key is currently shared with.
    3. To share this key with a crypto user, enter the following command: aws-cloudhsm > aws-cloudhsm > key share --filter attr.label="rsa_key_to_share" attr.class=private-key --username <USERNAME> --role crypto-user
  4. If CU1 is used to make client (that is, blue environment) connections to a CloudHSM cluster then change the password for CU2.
    1. Follow the instructions in To change HSM user passwords or step 2 of Approach 1 to change the password assigned to CU2.
  5. Spin up new client instances and use CU2 to configure the connection credentials (that is, green environment).
  6. Add the new client instances to a new target group for the existing Application Load Balancer (ALB).
  7. Next use the weighted target groups routing feature of ALB to route traffic to the newly configured environment.
    • You can use forward actions of the ALB listener rules setting to route requests to one or more target groups.
    • If you specify multiple target groups for a forward action, you must specify a weight for each target group. Each target group weight is a value from 0 to 999. Requests that match a listener rule with weighted target groups are distributed to these target groups based on their weights. For example, if you specify one with a weight of 10 and the other with a weight of 20, the target group with a weight of 20 receives twice as many requests as the other target group.
    • You can make these changes to the ALB setting using the AWS Command Line Interface (AWS CLI), AWS Management Console, or supported infrastructure as code (IaC) tools.
    • For more information, see Fine-tuning blue/green deployments on application load balancer.
  8. For the next password rotation iteration, you can switch back to using CU1 with updated credentials by updating your client instances and redeploying using steps 6 and 7.

Approach 3

The third approach is a variation of the previous approach as you build an identical environment (blue/green deployment) and change the crypto user password on the new environment to achieve zero downtime for the workload. You create two separate but identical CloudHSM clusters, with one serving as the live (blue) environment, and another as the test (green) environment in which changes are tested prior to deployment. After testing is complete in the green environment, production traffic is directed to the green environment and the blue environment is deprecated. Again, this approach adds complexity to your architecture so that you can redirect live application traffic to the new environment by deploying additional client instances and a CloudHSM cluster during the deployment and cutover window without having to restart. Additionally, changes made to the blue cluster after the green cluster was created won’t be available in the green cluster—something that can be mitigated by a brief embargo on changes while this cutover process is in progress. A key advantage to this approach is that it increases application availability without the need for a second crypto user, while still reducing deployment risk and simplifying the rollback process if a deployment fails. Such a deployment pattern is typically automated using continuous integration and continuous delivery (CI/CD) tools such as AWS CodeDeploy. For detailed deployment configuration options, see deployment configurations in CodeDeploy. Figure 3 depicts a possible architecture:

Figure 3: Approach 3 to update crypto user password

Figure 3: Approach 3 to update crypto user password

To implement approach 3:

  1. Create a cluster from backup. Make sure you restore the new cluster in the same Availability Zone as the existing CloudHSM cluster. This will be your green environment.
  2. Spin up new application instances (green environment) and configure them to connect to the new CloudHSM cluster.
  3. Take note of the new CloudHSM cluster security group and attach it to the new client instances.
  4. Follow the steps in To change HSM user passwords or Approach 1 step 2 to change the crypto user password on the new cluster.
  5. Update the client connecting to CloudHSM with the new password.
  6. Add the new client to the existing Application Load Balancer by following Approach 2 steps 6 and 7.
  7. After the deployment is complete, you can delete the old cluster and client instances (blue environment).
    • To delete the CloudHSM cluster using the console.
      1. Open the AWS CloudHSM console.
      2. Select the old cluster and then choose Delete cluster.
      3. Confirm that you want to delete the cluster, then choose Delete.
    • To delete the cluster using the AWS Command Line Interface (AWS CLI), use the following command: aws cloudhsmv2 delete-cluster --cluster-id <cluster ID>

How to choose an approach

To better understand which approach is the best fit for your use case, consider the following criteria:

  • Downtime: What is the acceptable amount of downtime for your workload?
  • Implementation complexity: Do you need to make architecture changes to your workload and how complex is the implementation effort?
  • Cost: Is the additional cost required for the approach acceptable to the business?
Downtime Relative Implementation complexity Relative infrastructure cost
Approach 1 Yes Low None
Approach 2 No Medium Medium
Approach 3 No Medium High

Approach 1 — especially when run within a scheduled maintenance window—is the most straightforward of the three approaches because there’s no additional infrastructure required, and workload downtime is the only tradeoff. This is best suited for applications where planned downtime is acceptable and you need to keep solution complexity low.

Approach 2 involves no downtime for the workload and the second crypto user serves as a backup for future password updates (such as if credentials are lost, or in case there are personnel changes). The downside is the initial planning required to set up the workload to handle multiple CUs, share all keys among the crypto users, and the additional cost. This is best suited for workloads that require zero downtime and an architecture that supports hot swapping of incoming traffic.

Approach 3 also supports zero downtime for the workload, with a complex implementation and some cost to set up additional infrastructure. This is best suited for workloads that have require zero downtime, have an architecture supports hot swapping of incoming traffic, and you don’t want to maintain a second crypto user that has shared access to all required cryptographic material.

Conclusion

In this post, we covered three approaches you can take to rotate the crypto user password on your CloudHSM cluster to align with AWS security best practices of the Well-Architected Framework and to meet your compliance, regulatory, or industry requirements. Each has considerations in terms of relative cost, complexity, and downtime. We recommend carefully considering mapping them to your workload and picking the approach best suited for your business and workload needs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS CloudHSM re:Post or contact AWS Support.

Shankar Rajagopalan

Shankar Rajagopalan

Shankar is a Senior Solutions Architect at Amazon Web Services in Austin, Texas. With two decades of experience in technology consulting, he specializes in sectors such as Telecom and Engineering. His present focus revolves around Security, Compliance, and Privacy.

Abderrahmen Mahmoudi

Abderrahmen Mahmoudi

Abderrahmen is a Senior Partner Solutions Architect at Amazon Web Services supporting partners and public sector customers in the Benelux region (Belgium, Netherlands, and Luxembourg). With a strong background working on Hardware Security Modules management, cryptographic operations, and tooling, he focuses on supporting customers in building secure solutions that meet compliance requirements.

Introducing the AWS WAF traffic overview dashboard

Post Syndicated from Dmitriy Novikov original https://aws.amazon.com/blogs/security/introducing-the-aws-waf-traffic-overview-dashboard/

For many network security operators, protecting application uptime can be a time-consuming challenge of baselining network traffic, investigating suspicious senders, and determining how best to mitigate risks. Simplifying this process and understanding network security posture at all times is the goal of most IT organizations that are trying to scale their applications without also needing to scale their security operations center staff. To help you with this challenge, AWS WAF introduced traffic overview dashboards so that you can make informed decisions about your security posture when your application is protected by AWS WAF.

In this post, we introduce the new dashboards and delve into a few use cases to help you gain better visibility into the overall security of your applications using AWS WAF and make informed decisions based on insights from the dashboards.

Introduction to traffic overview dashboards

The traffic overview dashboard in AWS WAF displays an overview of security-focused metrics so that you can identify and take action on security risks in a few clicks, such as adding rate-based rules during distributed denial of service (DDoS) events. The dashboards include near real-time summaries of the Amazon CloudWatch metrics that AWS WAF collects when it evaluates your application web traffic.

These dashboards are available by default and require no additional setup. They show metrics—total requests, blocked requests, allowed requests, bot compared to non-bot requests, bot categories, CAPTCHA solve rate, top 10 matched rules, and more—for each web access control list (web ACL) that you monitor with AWS WAF.

You can access default metrics such as the total number of requests, blocked requests, and common attacks blocked, or you can customize your dashboard with the metrics and visualizations that are most important to you.

These dashboards provide enhanced visibility and help you answer questions such as these:

  • What percent of the traffic that AWS WAF inspected is getting blocked?
  • What are the top originating countries for the traffic that’s getting blocked?
  • What are common attacks that AWS WAF detects and protects me from?
  • How do my traffic patterns from this week compare with last week?

The dashboard has native and out-of-the-box integration with CloudWatch. Using this integration, you can navigate back and forth between the dashboard and CloudWatch; for example, you can get a more granular metric overview by viewing the dashboard in CloudWatch. You can also add existing CloudWatch widgets and metrics to the traffic overview dashboard, bringing your tried-and-tested visibility structure into the dashboard.

With the introduction of the traffic overview dashboard, one AWS WAF tool—Sampled requests—is now a standalone tab inside a web ACL. In this tab, you can view a graph of the rule matches for web requests that AWS WAF has inspected. Additionally, if you have enabled request sampling, you can see a table view of a sample of the web requests that AWS WAF has inspected.

The sample of requests contains up to 100 requests that matched the criteria for a rule in the web ACL and another 100 requests for requests that didn’t match rules and thus had the default action for the web ACL applied. The requests in the sample come from the protected resources that have received requests for your content in the previous three hours.

The following figure shows a typical layout for the traffic overview dashboard. It categorizes inspected requests with a breakdown of each of the categories that display actionable insights, such as attack types, client device types, and countries. Using this information and comparing it with your expected traffic profile, you can decide whether to investigate further or block the traffic right away. For the example in Figure 1, you might want to block France-originating requests from mobile devices if your web application isn’t supposed to receive traffic from France and is a desktop-only application.

Figure 1: Dashboard with sections showing multiple categories serves as a single pane of glass

Figure 1: Dashboard with sections showing multiple categories serves as a single pane of glass

Use case 1: ­Analyze traffic patterns with the dashboard

In addition to visibility into your web traffic, you can use the new dashboard to analyze patterns that could indicate potential threats or issues. By reviewing the dashboard’s graphs and metrics, you can spot unusual spikes or drops in traffic that deserve further investigation.

The top-level overview shows the high-level traffic volume and patterns. From there, you can drill down into the web ACL metrics to see traffic trends and metrics for specific rules and rule groups. The dashboard displays metrics such as allowed requests, blocked requests, and more.

Notifications or alerts about a deviation from expected traffic patterns provide you a signal to explore the event. During your exploration, you can use the dashboard to understand the broader context and not just the event in isolation. This makes it simpler to detect a trend in anomalies that could signify a security event or misconfigured rules. For example, if you normally get 2,000 requests per minute from a particular country, but suddenly see 10,000 requests per minute from it, you should investigate. Using the dashboard, you can look at the traffic across various dimensions. The spike in requests alone might not be a clear indication of a threat, but if you see an additional indicator, such as an unexpected device type, this could be a strong reason for you to take follow-up action.

The following figure shows the actions taken by rules in a web ACL and which rule matched the most.

Figure 2: Multidimensional overview of the web requests

Figure 2: Multidimensional overview of the web requests

The dashboard also shows the top blocked and allowed requests over time. Check whether unusual spikes in blocked requests correspond to spikes in traffic from a particular IP address, country, or user agent. That could indicate attempted malicious activity or bot traffic.

The following figure shows a disproportionately larger number of matches to a rule indicating that a particular vector is used against a protected web application.

Figure 3: The top terminating rule could indicate a particular vector of an attack

Figure 3: The top terminating rule could indicate a particular vector of an attack

Likewise, review the top allowed requests. If you see a spike in traffic to a specific URL, you should investigate whether your application is working properly.

Next steps after you analyze traffic

After you’ve analyzed the traffic patterns, here are some next steps to consider:

  • Tune your AWS WAF rules to better match legitimate or malicious traffic based on your findings. You might be able to fine-tune rules to reduce false positives or false negatives. Tune rules that are blocking legitimate traffic by adjusting regular expressions or conditions.
  • Configure AWS WAF logging, and if you have a dedicated security information and event management (SIEM) solution, integrate the logging to enable automated alerting for anomalies.
  • Set up AWS WAF to automatically block known malicious IPs. You can maintain an IP block list based on identified threat actors. Additionally, you can use the Amazon IP reputation list managed rule group, which the Amazon Threat Research Team regularly updates.
  • If you see spikes in traffic to specific pages, check that your web applications are functioning properly to rule out application issues driving unusual patterns.
  • Add new rules to block new attack patterns that you spot in the traffic flows. Then review the metrics to help confirm the impact of the new rules.
  • Monitor source IPs for DDoS events and other malicious spikes. Use AWS WAF rate-based rules to help mitigate these spikes.
  • If you experience traffic floods, implement additional layers of protection by using CloudFront with DDoS protection.

The new dashboard gives you valuable insight into the traffic that reaches your applications and takes the guesswork out of traffic analysis. Using the insights that it provides, you can fine-tune your AWS WAF protections and block threats before they affect availability or data. Analyze the data regularly to help detect potential threats and make informed decisions about optimizing.

As an example, if you see an unexpected spike of traffic, which looks conspicuous in the dashboard compared to historical traffic patterns, from a country where you don’t anticipate traffic originating from, you can create a geographic match rule statement in your web ACL to block this traffic and prevent it from reaching your web application.

The dashboard is a great tool to gain insights and to understand how AWS WAF managed rules help protect your traffic.

Use case 2: Understand bot traffic during onboarding and fine-tune your bot control rule group

With AWS WAF Bot Control, you can monitor, block, or rate limit bots such as scrapers, scanners, crawlers, status monitors, and search engines. If you use the targeted inspection level of the rule group, you can also challenge bots that don’t self-identify, making it harder and more expensive for malicious bots to operate against your website.

On the traffic overview dashboard, under the Bot Control overview tab, you can see how much of your current traffic is coming from bots, based on request sampling (if you don’t have Bot Control enabled) and real-time CloudWatch metrics (if you do have Bot Control enabled).

During your onboarding phase, use this dashboard to monitor your traffic and understand how much of it comes from various types of bots. You can use this as a starting point to customize your bot management. For example, you can enable common bot control rule groups in count mode and see if desired traffic is being mislabeled. Then you can add rule exceptions, as described in AWS WAF Bot Control example: Allow a specific blocked bot.

The following figure shows a collection of widgets that visualize various dimensions of requests detected as generated by bots. By understanding categories and volumes, you can make an informed decision to either investigate by further delving into logs or block a specific category if it’s clear that it’s unwanted traffic.

Figure 4: Collection of bot-related metrics on the dashboard

Figure 4: Collection of bot-related metrics on the dashboard

After you get started, you can use the same dashboard to monitor your bot traffic and evaluate adding targeted detection for sophisticated bots that don’t self-identify. Targeted protections use detection techniques such as browser interrogation, fingerprinting, and behavior heuristics to identify bad bot traffic. AWS WAF tokens are an integral part of these enhanced protections.

AWS WAF creates, updates, and encrypts tokens for clients that successfully respond to silent challenges and CAPTCHA puzzles. When a client with a token sends a web request, it includes the encrypted token, and AWS WAF decrypts the token and verifies its contents.

In the Bot Control dashboard, the token status pane shows counts for the various token status labels, paired with the rule action that was applied to the request. The IP token absent thresholds pane shows data for requests from IPs that sent too many requests without a token. You can use this information to fine-tune your AWS WAF configuration.

For example, within a Bot Control rule group, it’s possible for a request without a valid token to exit the rule group evaluation and continue to be evaluated by the web ACL. To block requests that are missing their token or for which the token is rejected, you can add a rule to run immediately after the managed rule group to capture and block requests that the rule group doesn’t handle for you. Using the Token status pane, illustrated in Figure 5, you can also monitor the volume of requests that acquire tokens and decide if you want to rate limit or block such requests.

Figure 5: Token status enables monitoring of the volume of requests that acquire tokens

Figure 5: Token status enables monitoring of the volume of requests that acquire tokens

Comparison with CloudFront security dashboard

The AWS WAF traffic overview dashboard provides enhanced overall visibility into web traffic reaching resources that are protected with AWS WAF. In contrast, the CloudFront security dashboard brings AWS WAF visibility and controls directly to your CloudFront distribution. If you want the detailed visibility and analysis of patterns that could indicate potential threats or issues, then the AWS WAF traffic overview dashboard is the best fit. However, if your goal is to manage application delivery and security in one place without navigating between service consoles and to gain visibility into your application’s top security trends, allowed and blocked traffic, and bot activity, then the CloudFront security dashboard could be a better option.

Availability and pricing

The new dashboards are available in the AWS WAF console, and you can use them to better monitor your traffic. These dashboards are available by default, at no cost, and require no additional setup. CloudWatch logging has a separate pricing model and if you have full logging enabled you will incur CloudWatch charges. See here for more information about CloudWatch charges. You can customize the dashboards if you want to tailor the displayed data to the needs of your environment.

Conclusion

With the AWS WAF traffic overview dashboard, you can get actionable insights on your web security posture and traffic patterns that might need your attention to improve your perimeter protection.

In this post, you learned how to use the dashboard to help secure your web application. You walked through traffic patterns analysis and possible next steps. Additionally, you learned how to observe traffic from bots and follow up with actions related to them according to the needs of your application.

The AWS WAF traffic overview dashboard is designed to meet most use cases and be a go-to default option for security visibility over web traffic. However, if you’d prefer to create a custom solution, see the guidance in the blog post Deploy a dashboard for AWS WAF with minimal effort.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Dmitriy Novikov

Dmitriy Novikov

As a Senior Solutions Architect at AWS, Dmitriy supports AWS customers to use emerging technologies to generate business value. He’s a technology enthusiast who loves finding innovative solutions to complex challenges. He enjoys sharing his learnings on architecture and best practices in blog posts and whitepapers and at events. Outside of work, Dmitriy has a passion for reading and triathlons.

Harith Gaddamanugu

Harith Gaddamanugu

Harith works at AWS as a Senior Edge Specialist Solutions Architect. He stays motivated by solving problems for customers across AWS Perimeter Protection and Edge services. When he’s not working, he enjoys spending time outdoors with friends and family.

Data governance in the age of generative AI

Post Syndicated from Krishna Rupanagunta original https://aws.amazon.com/blogs/big-data/data-governance-in-the-age-of-generative-ai/

Data is your generative AI differentiator, and a successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Working with large language models (LLMs) for enterprise use cases requires the implementation of quality and privacy considerations to drive responsible AI. However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. The need for an end-to-end strategy for data management and data governance at every step of the journey—from ingesting, storing, and querying data to analyzing, visualizing, and running artificial intelligence (AI) and machine learning (ML) models—continues to be of paramount importance for enterprises.

In this post, we discuss the data governance needs of generative AI application data pipelines, a critical building block to govern data used by LLMs to improve the accuracy and relevance of their responses to user prompts in a safe, secure, and transparent manner. Enterprises are doing this by using proprietary data with approaches like Retrieval Augmented Generation (RAG), fine-tuning, and continued pre-training with foundation models.

Data governance is a critical building block across all these approaches, and we see two emerging areas of focus. First, many LLM use cases rely on enterprise knowledge that needs to be drawn from unstructured data such as documents, transcripts, and images, in addition to structured data from data warehouses. Unstructured data is typically stored across siloed systems in varying formats, and generally not managed or governed with the same level of rigor as structured data. Second, generative AI applications introduce a higher number of data interactions than conventional applications, which requires that the data security, privacy, and access control policies be implemented as part of the generative AI user workflows.

In this post, we cover data governance for building generative AI applications on AWS with a lens on structured and unstructured enterprise knowledge sources, and the role of data governance during the user request-response workflows.

Use case overview

Let’s explore an example of a customer support AI assistant. The following figure shows the typical conversational workflow that is initiated with a user prompt.

The workflow includes the following key data governance steps:

  1. Prompt user access control and security policies.
  2. Access policies to extract permissions based on relevant data and filter out results based on the prompt user role and permissions.
  3. Enforce data privacy policies such as personally identifiable information (PII) redactions.
  4. Enforce fine-grained access control.
  5. Grant the user role permissions for sensitive information and compliance policies.

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake. On the backend, the batch data engineering processes refreshing the enterprise data lake need to expand to ingest, transform, and manage unstructured data. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction). Finally, access control policies also need to be extended to the unstructured data objects and to vector data stores.

Let’s look at how data governance can be applied to the enterprise knowledge source data pipelines and the user request-response workflows.

Enterprise knowledge: Data management

The following figure summarizes data governance considerations for data pipelines and the workflow for applying data governance.

Data governance steps in data pipelines

In the above figure, the data engineering pipelines include the following data governance steps:

  1. Create and update a catalog through data evolution.
  2. Implement data privacy policies.
  3. Implement data quality by data type and source.
  4. Link structured and unstructured datasets.
  5. Implement unified fine-grained access controls for structured and unstructured datasets.

Let’s look at some of the key changes in the data pipelines namely, data cataloging, data quality, and vector embedding security in more detail.

Data discoverability

Unlike structured data, which is managed in well-defined rows and columns, unstructured data is stored as objects. For users to be able to discover and comprehend the data, the first step is to build a comprehensive catalog using the metadata that is generated and captured in the source systems. This starts with the objects (such as documents and transcript files) being ingested from the relevant source systems into the raw zone in the data lake in Amazon Simple Storage Service (Amazon S3) in their respective native formats (as illustrated in the preceding figure). From here, object metadata (such as file owner, creation date, and confidentiality level) is extracted and queried using Amazon S3 capabilities. Metadata can vary by data source, and it’s important to examine the fields and, where required, derive the necessary fields to complete all the necessary metadata. For instance, if an attribute like content confidentiality is not tagged at a document level in the source application, this may need to be derived as part of the metadata extraction process and added as an attribute in the data catalog. The ingestion process needs to capture object updates (changes, deletions) in addition to new objects on an ongoing basis. For detailed implementation guidance, refer to Unstructured data management and governance using AWS AI/ML and analytics services. To further simplify the discovery and introspection between business glossaries and technical data catalogs, you can use Amazon DataZone for business users to discover and share data stored across data silos.

Data privacy

Enterprise knowledge sources often contain PII and other sensitive data (such as addresses and Social Security numbers). Based on your data privacy policies, these elements need to be treated (masked, tokenized, or redacted) from the sources before they can be used for downstream use cases. From the raw zone in Amazon S3, the objects need to be processed before they can be consumed by downstream generative AI models. A key requirement here is PII identification and redaction, which you can implement with Amazon Comprehend. It’s important to remember that it will not always be feasible to strip away all the sensitive data without impacting the context of the data. Semantic context is one of the key factors that drive the accuracy and relevance of generative AI model outputs, and it’s critical to work backward from the use case and strike the necessary balance between privacy controls and model performance.

Data enrichment

In addition, additional metadata may need to be extracted from the objects. Amazon Comprehend provides capabilities for entity recognition (for example, identifying domain-specific data like policy numbers and claim numbers) and custom classification (for example, categorizing a customer care chat transcript based on the issue description). Furthermore, you may need to combine the unstructured and structured data to create a holistic picture of key entities, like customers. For example, in an airline loyalty scenario, there would be significant value in linking unstructured data capture of customer interactions (such as customer chat transcripts and customer reviews) with structured data signals (such as ticket purchases and miles redemption) to create a more complete customer profile that can then enable the delivery of better and more relevant trip recommendations. AWS Entity Resolution is an ML service that helps in matching and linking records. This service helps link related sets of information to create deeper, more connected data about key entities like customers, products, and so on, which can further improve the quality and relevance of LLM outputs. This is available in the transformed zone in Amazon S3 and is ready to be consumed downstream for vector stores, fine-tuning, or training of LLMs. After these transformations, data can be made available in the curated zone in Amazon S3.

Data quality

A critical factor to realizing the full potential of generative AI is dependent on the quality of the data that is used to train the models as well as the data that is used to augment and enhance the model response to a user input. Understanding the models and their outcomes in the context of accuracy, bias, and reliability is directly proportional to the quality of data used to build and train the models.

Amazon SageMaker Model Monitor provides a proactive detection of deviations in model data quality drift and model quality metrics drift. It also monitors bias drift in your model’s predictions and feature attribution. For more details, refer to Monitoring in-production ML models at large scale using Amazon SageMaker Model Monitor. Detecting bias in your model is a fundamental building block to responsible AI, and Amazon SageMaker Clarify helps detect potential bias that can produce a negative or a less accurate result. To learn more, see Learn how Amazon SageMaker Clarify helps detect bias.

A newer area of focus in generative AI is the use and quality of data in prompts from enterprise and proprietary data stores. An emerging best practice to consider here is shift-left, which puts a strong emphasis on early and proactive quality assurance mechanisms. In the context of data pipelines designed to process data for generative AI applications, this implies identifying and resolving data quality issues earlier upstream to mitigate the potential impact of data quality issues later. AWS Glue Data Quality not only measures and monitors the quality of your data at rest in your data lakes, data warehouses, and transactional databases, but also allows early detection and correction of quality issues for your extract, transform, and load (ETL) pipelines to ensure your data meets the quality standards before it is consumed. For more details, refer to Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog.

Vector store governance

Embeddings in vector databases elevate the intelligence and capabilities of generative AI applications by enabling features such as semantic search and reducing hallucinations. Embeddings typically contain private and sensitive data, and encrypting the data is a recommended step in the user input workflow. Amazon OpenSearch Serverless stores and searches your vector embeddings, and encrypts your data at rest with AWS Key Management Service (AWS KMS). For more details, see Introducing the vector engine for Amazon OpenSearch Serverless, now in preview. Similarly, additional vector engine options on AWS, including Amazon Kendra and Amazon Aurora, encrypt your data at rest with AWS KMS. For more information, refer to Encryption at rest and Protecting data using encryption.

As embeddings are generated and stored in a vector store, controlling access to the data with role-based access control (RBAC) becomes a key requirement to maintaining overall security. Amazon OpenSearch Service provides fine-grained access controls (FGAC) features with AWS Identity and Access Management (IAM) rules that can be associated with Amazon Cognito users. Corresponding user access control mechanisms are also provided by OpenSearch Serverless, Amazon Kendra, and Aurora. To learn more, refer to Data access control for Amazon OpenSearch Serverless, Controlling user access to documents with tokens, and Identity and access management for Amazon Aurora, respectively.

User request-response workflows

Controls in the data governance plane need to be integrated into the generative AI application as part of the overall solution deployment to ensure compliance with data security (based on role-based access controls) and data privacy (based on role-based access to sensitive data) policies. The following figure illustrates the workflow for applying data governance.

Data governance in user prompt workflow

The workflow includes the following key data governance steps:

  1. Provide a valid input prompt for alignment with compliance policies (for example, bias and toxicity).
  2. Generate a query by mapping prompt keywords with the data catalog.
  3. Apply FGAC policies based on user role.
  4. Apply RBAC policies based on user role.
  5. Apply data and content redaction to the response based on user role permissions and compliance policies.

As part of the prompt cycle, the user prompt must be parsed and keywords extracted to ensure alignment with compliance policies using a service like Amazon Comprehend (see New for Amazon Comprehend – Toxicity Detection) or Guardrails for Amazon Bedrock (preview). When that is validated, if the prompt requires structured data to be extracted, the keywords can be used against the data catalog (business or technical) to extract the relevant data tables and fields and construct a query from the data warehouse. The user permissions are evaluated using AWS Lake Formation to filter the relevant data. In the case of unstructured data, the search results are restricted based on the user permission policies implemented in the vector store. As a final step, the output response from the LLM needs to be evaluated against user permissions (to ensure data privacy and security) and compliance with safety (for example, bias and toxicity guidelines).

Although this process is specific to a RAG implementation and is applicable to other LLM implementation strategies, there are additional controls:

  • Prompt engineering – Access to the prompt templates to invoke need to be restricted based on access controls augmented by business logic.
  • Fine-tuning models and training foundation models – In cases where objects from the curated zone in Amazon S3 are used as training data for fine-tuning the foundation models, the permissions policies need to be configured with Amazon S3 identity and access management at the bucket or object level based on the requirements.

Summary

Data governance is critical to enabling organizations to build enterprise generative AI applications. As enterprise use cases continue to evolve, there will be a need to expand the data infrastructure to govern and manage new, diverse, unstructured datasets to ensure alignment with privacy, security, and quality policies. These policies need to be implemented and managed as part of data ingestion, storage, and management of the enterprise knowledge base along with the user interaction workflows. This makes sure that the generative AI applications not only minimize the risk of sharing inaccurate or wrong information, but also protect from bias and toxicity that can lead to harmful or libelous outcomes. To learn more about data governance on AWS, see What is Data Governance?

In subsequent posts, we will provide implementation guidance on how to expand the governance of the data infrastructure to support generative AI use cases.


About the Authors

Krishna Rupanagunta leads a team of Data and AI Specialists at AWS. He and his team work with customers to help them innovate faster and make better decisions using Data, Analytics, and AI/ML. He can be reached via LinkedIn.

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached via LinkedIn.

Raghvender Arni (Arni) leads the Customer Acceleration Team (CAT) within AWS Industries. The CAT is a global cross-functional team of customer facing cloud architects, software engineers, data scientists, and AI/ML experts and designers that drives innovation via advanced prototyping, and drives cloud operational excellence via specialized technical expertise.

Enhance container software supply chain visibility through SBOM export with Amazon Inspector and QuickSight

Post Syndicated from Jason Ng original https://aws.amazon.com/blogs/security/enhance-container-software-supply-chain-visibility-through-sbom-export-with-amazon-inspector-and-quicksight/

In this post, I’ll show how you can export software bills of materials (SBOMs) for your containers by using an AWS native service, Amazon Inspector, and visualize the SBOMs through Amazon QuickSight, providing a single-pane-of-glass view of your organization’s software supply chain.

The concept of a bill of materials (BOM) originated in the manufacturing industry in the early 1960s. It was used to keep track of the quantities of each material used to manufacture a completed product. If parts were found to be defective, engineers could then use the BOM to identify products that contained those parts. An SBOM extends this concept to software development, allowing engineers to keep track of vulnerable software packages and quickly remediate the vulnerabilities.

Today, most software includes open source components. A Synopsys study, Walking the Line: GitOps and Shift Left Security, shows that 8 in 10 organizations reported using open source software in their applications. Consider a scenario in which you specify an open source base image in your Dockerfile but don’t know what packages it contains. Although this practice can significantly improve developer productivity and efficiency, the decreased visibility makes it more difficult for your organization to manage risk effectively.

It’s important to track the software components and their versions that you use in your applications, because a single affected component used across multiple organizations could result in a major security impact. According to a Gartner report titled Gartner Report for SBOMs: Key Takeaways You Should know, by 2025, 60 percent of organizations building or procuring critical infrastructure software will mandate and standardize SBOMs in their software engineering practice, up from less than 20 percent in 2022. This will help provide much-needed visibility into software supply chain security.

Integrating SBOM workflows into the software development life cycle is just the first step—visualizing SBOMs and being able to search through them quickly is the next step. This post describes how to process the generated SBOMs and visualize them with Amazon QuickSight. AWS also recently added SBOM export capability in Amazon Inspector, which offers the ability to export SBOMs for Amazon Inspector monitored resources, including container images.

Why is vulnerability scanning not enough?

Scanning and monitoring vulnerable components that pose cybersecurity risks is known as vulnerability scanning, and is fundamental to organizations for ensuring a strong and solid security posture. Scanners usually rely on a database of known vulnerabilities, the most common being the Common Vulnerabilities and Exposures (CVE) database.

Identifying vulnerable components with a scanner can prevent an engineer from deploying affected applications into production. You can embed scanning into your continuous integration and continuous delivery (CI/CD) pipelines so that images with known vulnerabilities don’t get pushed into your image repository. However, what if a new vulnerability is discovered but has not been added to the CVE records yet? A good example of this is the Apache Log4j vulnerability, which was first disclosed on Nov 24, 2021 and only added as a CVE on Dec 1, 2021. This means that for 7 days, scanners that relied on the CVE system weren’t able to identify affected components within their organizations. This issue is known as a zero-day vulnerability. Being able to quickly identify vulnerable software components in your applications in such situations would allow you to assess the risk and come up with a mitigation plan without waiting for a vendor or supplier to provide a patch.

In addition, it’s also good hygiene for your organization to track usage of software packages, which provides visibility into your software supply chain. This can improve collaboration between developers, operations, and security teams, because they’ll have a common view of every software component and can collaborate effectively to address security threats.

In this post, I present a solution that uses the new Amazon Inspector feature to export SBOMs from container images, process them, and visualize the data in QuickSight. This gives you the ability to search through your software inventory on a dashboard and to use natural language queries through QuickSight Q, in order to look for vulnerabilities.

Solution overview

Figure 1 shows the architecture of the solution. It is fully serverless, meaning there is no underlying infrastructure you need to manage. This post uses a newly released feature within Amazon Inspector that provides the ability to export a consolidated SBOM for Amazon Inspector monitored resources across your organization in commonly used formats, including CycloneDx and SPDX.

Figure 1: Solution architecture diagram

Figure 1: Solution architecture diagram

The workflow in Figure 1 is as follows:

  1. The image is pushed into Amazon Elastic Container Registry (Amazon ECR), which sends an Amazon EventBridge event.
  2. This invokes an AWS Lambda function, which starts the SBOM generation job for the specific image.
  3. When the job completes, Amazon Inspector deposits the SBOM file in an Amazon Simple Storage Service (Amazon S3) bucket.
  4. Another Lambda function is invoked whenever a new JSON file is deposited. The function performs the data transformation steps and uploads the new file into a new S3 bucket.
  5. Amazon Athena is then used to perform preliminary data exploration.
  6. A dashboard on Amazon QuickSight displays SBOM data.

Implement the solution

This section describes how to deploy the solution architecture.

In this post, you’ll perform the following tasks:

  • Create S3 buckets and AWS KMS keys to store the SBOMs
  • Create an Amazon Elastic Container Registry (Amazon ECR) repository
  • Deploy two AWS Lambda functions to initiate the SBOM generation and transformation
  • Set up Amazon EventBridge rules to invoke Lambda functions upon image push into Amazon ECR
  • Run AWS Glue crawlers to crawl the transformed SBOM S3 bucket
  • Run Amazon Athena queries to review SBOM data
  • Create QuickSight dashboards to identify libraries and packages
  • Use QuickSight Q to identify libraries and packages by using natural language queries

Deploy the CloudFormation stack

The AWS CloudFormation template we’ve provided provisions the S3 buckets that are required for the storage of raw SBOMs and transformed SBOMs, the Lambda functions necessary to initiate and process the SBOMs, and EventBridge rules to run the Lambda functions based on certain events. An empty repository is provisioned as part of the stack, but you can also use your own repository.

To deploy the CloudFormation stack

  1. Download the CloudFormation template.
  2. Browse to the CloudFormation service in your AWS account and choose Create Stack.
  3. Upload the CloudFormation template you downloaded earlier.
  4. For the next step, Specify stack details, enter a stack name.
  5. You can keep the default value of sbom-inspector for EnvironmentName.
  6. Specify the Amazon Resource Name (ARN) of the user or role to be the admin for the KMS key.
  7. Deploy the stack.

Set up Amazon Inspector

If this is the first time you’re using Amazon Inspector, you need to activate the service. In the Getting started with Amazon Inspector topic in the Amazon Inspector User Guide, follow Step 1 to activate the service. This will take some time to complete.

Figure 2: Activate Amazon Inspector

Figure 2: Activate Amazon Inspector

SBOM invocation and processing Lambda functions

This solution uses two Lambda functions written in Python to perform the invocation task and the transformation task.

  • Invocation task — This function is run whenever a new image is pushed into Amazon ECR. It takes in the repository name and image tag variables and passes those into the create_sbom_export function in the SPDX format. This prevents duplicated SBOMs, which helps to keep the S3 data size small.
  • Transformation task — This function is run whenever a new file with the suffix .json is added to the raw S3 bucket. It creates two files, as follows:
    1. It extracts information such as image ARN, account number, package, package version, operating system, and SHA from the SBOM and exports this data to the transformed S3 bucket under a folder named sbom/.
    2. Because each package can have more than one CVE, this function also extracts the CVE from each package and stores it in the same bucket in a directory named cve/. Both files are exported in Apache Parquet so that the file is in a format that is optimized for queries by Amazon Athena.

Populate the AWS Glue Data Catalog

To populate the AWS Glue Data Catalog, you need to generate the SBOM files by using the Lambda functions that were created earlier.

To populate the AWS Glue Data Catalog

  1. You can use an existing image, or you can continue on to create a sample image.
  2. Open an AWS Cloudshell terminal.
  3. Run the follow commands
    # Pull the nginx image from a public repo
    docker pull public.ecr.aws/nginx/nginx:1.19.10-alpine-perl
    
    docker tag public.ecr.aws/nginx/nginx:1.19.10-alpine-perl <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl
    
    # Authenticate to ECR, fill in your account id
    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com
    
    # Push the image into ECR
    docker push <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl

  4. An image is pushed into the Amazon ECR repository in your account. This invokes the Lambda functions that perform the SBOM export by using Amazon Inspector and converts the SBOM file to Parquet.
  5. Verify that the Parquet files are in the transformed S3 bucket:
    1. Browse to the S3 console and choose the bucket named sbom-inspector-<ACCOUNT-ID>-transformed. You can also track the invocation of each Lambda function in the Amazon CloudWatch log console.
    2. After the transformation step is complete, you will see two folders (cve/ and sbom/)in the transformed S3 bucket. Choose the sbom folder. You will see the transformed Parquet file in it. If there are CVEs present, a similar file will appear in the cve folder.

    The next step is to run an AWS Glue crawler to determine the format, schema, and associated properties of the raw data. You will need to crawl both folders in the transformed S3 bucket and store the schema in separate tables in the AWS Glue Data Catalog.

  6. On the AWS Glue Service console, on the left navigation menu, choose Crawlers.
  7. On the Crawlers page, choose Create crawler. This starts a series of pages that prompt you for the crawler details.
  8. In the Crawler name field, enter sbom-crawler, and then choose Next.
  9. Under Data sources, select Add a data source.
  10. Now you need to point the crawler to your data. On the Add data source page, choose the Amazon S3 data store. This solution in this post doesn’t use a connection, so leave the Connection field blank if it’s visible.
  11. For the option Location of S3 data, choose In this account. Then, for S3 path, enter the path where the crawler can find the sbom and cve data, which is s3://sbom-inspector-<ACCOUNT-ID>-transformed/sbom/ and s3://sbom-inspector-<ACCOUNT-ID>-transformed/cve/. Leave the rest as default and select Add an S3 data source.
     
    Figure 3: Data source for AWS Glue crawler

    Figure 3: Data source for AWS Glue crawler

  12. The crawler needs permissions to access the data store and create objects in the Data Catalog. To configure these permissions, choose Create an IAM role. The AWS Identity and Access Management (IAM) role name starts with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter sbomcrawler, and then choose Next.
  13. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. To create a database, choose Add database. In the pop-up window, enter sbom-db for the database name, and then choose Create.
  14. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes. After you’ve reviewed the information, choose Finish to create the crawler.
    Figure 4: Creation of the AWS Glue crawler

    Figure 4: Creation of the AWS Glue crawler

  15. Select the newly created crawler and choose Run.
  16. After the crawler runs successfully, verify that the table is created and the data schema is populated.
     
    Figure 5: Table populated from the AWS Glue crawler

    Figure 5: Table populated from the AWS Glue crawler

Set up Amazon Athena

Amazon Athena performs the initial data exploration and validation. Athena is a serverless interactive analytics service built on open source frameworks that supports open-table and file formats. Athena provides a simplified, flexible way to analyze data in sources like Amazon S3 by using standard SQL queries. If you are SQL proficient, you can query the data source directly; however, not everyone is familiar with SQL. In this section, you run a sample query and initialize the service so that it can used in QuickSight later on.

To start using Amazon Athena

  1. In the AWS Management Console, navigate to the Athena console.
  2. For Database, select sbom-db (or select the database you created earlier in the crawler).
  3. Navigate to the Settings tab located at the top right corner of the console. For Query result location, select the Athena S3 bucket created from the CloudFormation template, sbom-inspector-<ACCOUNT-ID>-athena.
  4. Keep the defaults for the rest of the settings. You can now return to the Query Editor and start writing and running your queries on the sbom-db database.

You can use the following sample query.

select package, packageversion, cve, sha, imagearn from sbom
left join cve
using (sha, package, packageversion)
where cve is not null;

Your Athena console should look similar to the screenshot in Figure 6.

Figure 6: Sample query with Amazon Athena

Figure 6: Sample query with Amazon Athena

This query joins the two tables and selects only the packages with CVEs identified. Alternatively, you can choose to query for specific packages or identify the most common package used in your organization.

Sample output:

# package packageversion cve sha imagearn
<PACKAGE_NAME> <PACKAGE_VERSION> <CVE> <IMAGE_SHA> <ECR_IMAGE_ARN>

Visualize data with Amazon QuickSight

Amazon QuickSight is a serverless business intelligence service that is designed for the cloud. In this post, it serves as a dashboard that allows business users who are unfamiliar with SQL to identify zero-day vulnerabilities. This can also reduce the operational effort and time of having to look through several JSON documents to identify a single package across your image repositories. You can then share the dashboard across teams without having to share the underlying data.

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-memory engine that QuickSight uses to perform advanced calculations. In a large organization where you could have millions of SBOM records stored in S3, importing your data into SPICE helps to reduce the time to process and serve the data. You can also use the feature to perform a scheduled refresh to obtain the latest data from S3.

QuickSight also has a feature called QuickSight Q. With QuickSightQ, you can use natural language to interact with your data. If this is the first time you are initializing QuickSight, subscribe to QuickSight and select Enterprise + Q. It will take roughly 20–30 minutes to initialize for the first time. Otherwise, if you are already using QuickSight, you will need to enable QuickSight Q by subscribing to it in the QuickSight console.

Finally, in QuickSight you can select different data sources, such as Amazon S3 and Athena, to create custom visualizations. In this post, we will use the two Athena tables as the data source to create a dashboard to keep track of the packages used in your organization and the resulting CVEs that come with them.

Prerequisites for setting up the QuickSight dashboard

This process will be used to create the QuickSight dashboard from a template already pre-provisioned through the command line interface (CLI). It also grants the necessary permissions for QuickSight to access the data source. You will need the following:

  • AWS Command Line Interface (AWS CLI) programmatic access with read and write permissions to QuickSight.
  • A QuickSight + Q subscription (only if you want to use the Q feature).
  • QuickSight permissions to Amazon S3 and Athena (enable these through the QuickSight security and permissions interface).
  • Set the default AWS Region where you want to deploy the QuickSight dashboard. This post assumes that you’re using the us-east-1 Region.

Create datasets

In QuickSight, create two datasets, one for the sbom table and another for the cve table.

  1. In the QuickSight console, select the Dataset tab.
  2. Choose Create dataset, and then select the Athena data source.
  3. Name the data source sbom and choose Create data source.
  4. Select the sbom table.
  5. Choose Visualize to complete the dataset creation. (Delete the analyses automatically created for you because you will create your own analyses afterwards.)
  6. Navigate back to the main QuickSight page and repeat steps 1–4 for the cve dataset.

Merge datasets

Next, merge the two datasets to create the combined dataset that you will use for the dashboard.

  1. On the Datasets tab, edit the sbom dataset and add the cve dataset.
  2. Set three join clauses, as follows:
    1. Sha : Sha
    2. Package : Package
    3. Packageversion : Packageversion
  3. Perform a left merge, which will append the cve ID to the package and package version in the sbom dataset.
     
    Figure 7: Combining the sbom and cve datasets

    Figure 7: Combining the sbom and cve datasets

Next, you will create a dashboard based on the combined sbom dataset.

Prepare configuration files

In your terminal, export the following variables. Substitute <QuickSight username> in the QS_USER_ARN variable with your own username, which can be found in the Amazon QuickSight console.

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export TEMPLATE_ID=”sbom_dashboard”
export QS_USER_ARN=$(aws quicksight describe-user --aws-account-id $ACCOUNT_ID --namespace default --user-name <QuickSight username> | jq .User.Arn)
export QS_DATA_ARN=$(aws quicksight search-data-sets --aws-account-id $ACCOUNT_ID --filters Name="DATASET_NAME",Operator="StringLike",Value="sbom" | jq .DataSetSummaries[0].Arn)

Validate that the variables are set properly. This is required for you to move on to the next step; otherwise you will run into errors.

echo ACCOUNT_ID is $ACCOUNT_ID || echo ACCOUNT_ID is not set
echo TEMPLATE_ID is $TEMPLATE_ID || echo TEMPLATE_ID is not set
echo QUICKSIGHT USER ARN is $QS_USER_ARN || echo QUICKSIGHT USER ARN is not set
echo QUICKSIGHT DATA ARN is $QS_DATA_ARN || echo QUICKSIGHT DATA ARN is not set

Next, use the following commands to create the dashboard from a predefined template and create the IAM permissions needed for the user to view the QuickSight dashboard.

cat < ./dashboard.json
{
    "SourceTemplate": {
      "DataSetReferences": [
        {
          "DataSetPlaceholder": "sbom",
          "DataSetArn": $QS_DATA_ARN
        }
      ],
      "Arn": "arn:aws:quicksight:us-east-1:293424211206:template/sbom_qs_template"
    }
}
EOF

cat < ./dashboardpermissions.json
[
    {
      "Principal": $QS_USER_ARN,
      "Actions": [
        "quicksight:DescribeDashboard",
        "quicksight:ListDashboardVersions",
        "quicksight:UpdateDashboardPermissions",
        "quicksight:QueryDashboard",
        "quicksight:UpdateDashboard",
        "quicksight:DeleteDashboard",
        "quicksight:DescribeDashboardPermissions",
        "quicksight:UpdateDashboardPublishedVersion"
      ]
    }
]
EOF

Run the following commands to create the dashboard in your QuickSight console.

aws quicksight create-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --name sbom-dashboard --source-entity file://dashboard.json

Note: Run the following describe-dashboard command, and confirm that the response contains a status code of 200. The 200-status code means that the dashboard exists.

aws quicksight describe-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID

Use the following update-dashboard-permissions AWS CLI command to grant the appropriate permissions to QuickSight users.

aws quicksight update-dashboard-permissions --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --grant-permissions file://dashboardpermissions.json

You should now be able to see the dashboard in your QuickSight console, similar to the one in Figure 8. It’s an interactive dashboard that shows you the number of vulnerable packages you have in your repositories and the specific CVEs that come with them. You can navigate to the specific image by selecting the CVE (middle right bar chart) or list images with a specific vulnerable package (bottom right bar chart).

Note: You won’t see the exact same graph as in Figure 8. It will change according to the image you pushed in.

Figure 8: QuickSight dashboard containing SBOM information

Figure 8: QuickSight dashboard containing SBOM information

Alternatively, you can use QuickSight Q to extract the same information from your dataset through natural language. You will need to create a topic and add the dataset you added earlier. For detailed information on how to create a topic, see the Amazon QuickSight User Guide. After QuickSight Q has completed indexing the dataset, you can start to ask questions about your data.

Figure 9: Natural language query with QuickSight Q

Figure 9: Natural language query with QuickSight Q

Conclusion

This post discussed how you can use Amazon Inspector to export SBOMs to improve software supply chain transparency. Container SBOM export should be part of your supply chain mitigation strategy and monitored in an automated manner at scale.

Although it is a good practice to generate SBOMs, it would provide little value if there was no further analysis being done on them. This solution enables you to visualize your SBOM data through a dashboard and natural language, providing better visibility into your security posture. Additionally, this solution is also entirely serverless, meaning there are no agents or sidecars to set up.

To learn more about exporting SBOMs with Amazon Inspector, see the Amazon Inspector User Guide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Jason Ng

Jason Ng

Jason is a Cloud Sales Center Solutions Architect at AWS. He works with enterprise and independent software vendor (ISV) greenfield customers in ASEAN countries and is part of the Containers Technical Field Community (TFC). He enjoys helping customers modernize their applications, drive growth, and reduce total cost of ownership.

How to develop an Amazon Security Lake POC

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/how-to-develop-an-amazon-security-lake-poc/

You can use Amazon Security Lake to simplify log data collection and retention for Amazon Web Services (AWS) and non-AWS data sources. To make sure that you get the most out of your implementation requires proper planning.

In this post, we will show you how to plan and implement a proof of concept (POC) for Security Lake to help you determine the functionality and value of Security Lake in your environment, so that your team can confidently design and implement in production. We will walk you through the following steps:

  1. Understand the functionality and value of Security Lake
  2. Determine success criteria for the POC
  3. Define your Security Lake configuration
  4. Prepare for deployment
  5. Enable Security Lake
  6. Validate deployment

Understand the functionality of Security Lake

Figure 1 summarizes the main features of Security Lake and the context of how to use it:

Figure 1: Overview of Security Lake functionality

Figure 1: Overview of Security Lake functionality

As shown in the figure, Security Lake ingests and normalizes logs from data sources such as AWS services, AWS Partner sources, and custom sources. Security Lake also manages the lifecycle, orchestration, and subscribers. Subscribers can be AWS services, such as Amazon Athena, or AWS Partner subscribers.

There are four primary functions that Security Lake provides:

  • Centralize visibility to your data from AWS environments, SaaS providers, on-premises, and other cloud data sources — You can collect log sources from AWS services such as AWS CloudTrail management events, Amazon Simple Storage Service (Amazon S3) data events, AWS Lambda data events, Amazon Route 53 Resolver logs, VPC Flow Logs, and AWS Security Hub findings, in addition to log sources from on-premises, other cloud services, SaaS applications, and custom sources. Security Lake automatically aggregates the security data across AWS Regions and accounts.
  • Normalize your security data to an open standard — Security Lake normalizes log sources in a common schema, the Open Security Schema Framework (OCSF), and stores them in compressed parquet files.
  • Use your preferred analytics tools to analyze your security data — You can use AWS tools, such as Athena and Amazon OpenSearch Service, or you can utilize external security tools to analyze the data in Security Lake.
  • Optimize and manage your security data for more efficient storage and query — Security Lake manages the lifecycle of your data with customizable retention settings with automated storage tiering to help provide more cost-effective storage.

Determine success criteria

By establishing success criteria, you can assess whether Security Lake has helped address the challenges that you are facing. Some example success criteria include:

  • I need to centrally set up and store AWS logs across my organization in AWS Organizations for multiple log sources.
  • I need to more efficiently collect VPC Flow Logs in my organization and analyze them in my security information and event management (SIEM) solution.
  • I want to use OpenSearch Service to replace my on-premises SIEM.
  • I want to collect AWS log sources and custom sources for machine learning with Amazon Sagemaker.
  • I need to establish a dashboard in Amazon QuickSight to visualize my Security Hub findings and a custom log source data.

Review your success criteria to make sure that your goals are realistic given your timeframe and potential constraints that are specific to your organization. For example, do you have full control over the creation of AWS services that are deployed in an organization? Do you have resources that can dedicate time to implement and test? Is this time convenient for relevant stakeholders to evaluate the service?

The timeframe of your POC will depend on your answers to these questions.

Important: Security Lake has a 15-day free trial per account that you use from the time that you enable Security Lake. This is the best way to estimate the costs for each Region throughout the trial, which is an important consideration when you configure your POC.

Define your Security Lake configuration

After you establish your success criteria, you should define your desired Security Lake configuration. Some important decisions include the following:

  • Determine AWS log sources — Decide which AWS log sources to collect. For information about the available options, see Collecting data from AWS services.
  • Determine third-party log sources — Decide if you want to include non-AWS service logs as sources in your POC. For more information about your options, see Third-party integrations with Security Lake; the integrations listed as “Source” can send logs to Security Lake.

    Note: You can add third-party integrations after the POC or in a second phase of the POC. Pre-planning will be required to make sure that you can get these set up during the 15-day free trial. Third-party integrations usually take more time to set up than AWS service logs.

  • Select a delegated administrator – Identify which account will serve as the delegated administrator. Make sure that you have the appropriate permissions from the organization admin account to identify and enable the account that will be your Security Lake delegated administrator. This account will be the location for the S3 buckets with your security data and where you centrally configure Security Lake. The AWS Security Reference Architecture (AWS SRA) recommends that you use the AWS logging account for this purpose. In addition, make sure to review Important considerations for delegated Security Lake administrators.
  • Select accounts in scope — Define which accounts to collect data from. To get the most realistic estimate of the cost of Security Lake, enable all accounts across your organization during the free trial.
  • Determine analytics tool — Determine if you want to use native AWS analytics tools, such as Athena and OpenSearch Service, or an existing SIEM, where the SIEM is a subscriber to Security Lake.
  • Define log retention and Regions — Define your log retention requirements and Regional restrictions or considerations.

Prepare for deployment

After you determine your success criteria and your Security Lake configuration, you should have an idea of your stakeholders, desired state, and timeframe. Now you need to prepare for deployment. In this step, you should complete as much as possible before you deploy Security Lake. The following are some steps to take:

  • Create a project plan and timeline so that everyone involved understands what success look like and what the scope and timeline is.
  • Define the relevant stakeholders and consumers of the Security Lake data. Some common stakeholders include security operations center (SOC) analysts, incident responders, security engineers, cloud engineers, finance, and others.
  • Define who is responsible, accountable, consulted, and informed during the deployment. Make sure that team members understand their roles.
  • Make sure that you have access in your management account to delegate and administrator. For further details, see IAM permissions required to designate the delegated administrator.
  • Consider other technical prerequisites that you need to accomplish. For example, if you need roles in addition to what Security Lake creates for custom extract, transform, and load (ETL) pipelines for custom sources, can you work with the team in charge of that process before the POC?

Enable Security Lake

The next step is to enable Security Lake in your environment and configure your sources and subscribers.

  1. Deploy Security Lake across the Regions, accounts, and AWS log sources that you previously defined.
  2. Configure custom sources that are in scope for your POC.
  3. Configure analytics tools in scope for your POC.

Validate deployment

The final step is to confirm that you have configured Security Lake and additional components, validate that everything is working as intended, and evaluate the solution against your success criteria.

  • Validate log collection — Verify that you are collecting the log sources that you configured. To do this, check the S3 buckets in the delegated administrator account for the logs.
  • Validate analytics tool — Verify that you can analyze the log sources in your analytics tool of choice. If you don’t want to configure additional analytics tooling, you can use Athena, which is configured when you set up Security Lake. For sample Athena queries, see Amazon Security Lake Example Queries on GitHub and Security Lake queries in the documentation.
  • Obtain a cost estimate — In the Security Lake console, you can review a usage page to verify that the cost of Security Lake in your environment aligns with your expectations and budgets.
  • Assess success criteria — Determine if you achieved the success criteria that you defined at the beginning of the project.

Next steps

Next steps will largely depend on whether you decide to move forward with Security Lake.

  • Determine if you have the approval and budget to use Security Lake.
  • Expand to other data sources that can help you provide more security outcomes for your business.
  • Configure S3 lifecycle policies to efficiently store logs long term based on your requirements.
  • Let other teams know that they can subscribe to Security Lake to use the log data for their own purposes. For example, a development team that gets access to CloudTrail through Security Lake can analyze the logs to understand the permissions needed for an application.

Conclusion

In this blog post, we showed you how to plan and implement a Security Lake POC. You learned how to do so through phases, including defining success criteria, configuring Security Lake, and validating that Security Lake meets your business needs.

As a customer, this guide will help you run a successful proof of value (POV) with Security Lake. It guides you in assessing the value and factors to consider when deciding to implement the current features.

Further resources

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Anna McAbee

Anna McAbee

Anna is a Security Specialist Solutions Architect focused on threat detection and incident response at AWS. Before AWS, she worked as an AWS customer in financial services on both the offensive and defensive sides of security. Outside of work, Anna enjoys cheering on the Florida Gators football team, wine tasting, and traveling the world.

Author

Marshall Jones

Marshall is a Worldwide Security Specialist Solutions Architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Marc Luescher

Marc Luescher

Marc is a Senior Solutions Architect helping enterprise customers be successful, focusing strongly on threat detection, incident response, and data protection. His background is in networking, security, and observability. Previously, he worked in technical architecture and security hands-on positions within the healthcare sector as an AWS customer. Outside of work, Marc enjoys his 3 dogs, 4 cats, and 20+ chickens.

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-amazon-opensearch-ingestion-to-migrate-to-amazon-opensearch-serverless/

Amazon OpenSearch Serverless is an on-demand auto scaling configuration for Amazon OpenSearch Service. Since its release, the interest for OpenSearch Serverless had been steadily growing. Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. Until now, customers have had to rely on using custom code or third-party solutions to move the data between provisioned OpenSearch Service domains and OpenSearch Serverless.

We recently introduced a feature with Amazon OpenSearch Ingestion (OSI) to make this migration even more effortless. OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections.

In this post, we outline the steps to make migrate the data between provisioned OpenSearch Service domains and OpenSearch Serverless. Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post.

Solution overview

The following diagram shows the necessary components for moving data between OpenSearch Service provisioned domains and OpenSearch Serverless using OSI. You will use OSI with OpenSearch Service as source and an OpenSearch Serverless collection as sink.

Prerequisites

Before getting started, complete the following steps to create the necessary resources:

  1. Create an AWS Identity and Access Management (IAM) role that the OpenSearch Ingestion pipeline will assume to write to the OpenSearch Serverless collection. This role needs to be specified in the sts_role_arn parameter of the pipeline configuration.
  2. Attach a permissions policy to the role to allow it to read data from the OpenSearch Service domain. The following is a sample policy with least privileges:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Action":"es:ESHttpGet",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_cat/indices",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search"
             ]
          },
          {
             "Effect":"Allow",
             "Action":"es:ESHttpPost",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/point_in_time",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/scroll"
             ]
          },
          {
             "Effect":"Allow",
             "Action":"es:ESHttpDelete",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/point_in_time",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll"
             ]
          }
       ]
    }

  3. Attach a permissions policy to the role to allow it to send data to the collection. The following is a sample policy with least privileges:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": [
            "aoss:BatchGetCollection",
            "aoss:APIAccessAll"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:aoss:{region}:{your-account-id}:collection/{collection-id}"
        },
        {
          "Action": [
            "aoss:CreateSecurityPolicy",
            "aoss:GetSecurityPolicy",
            "aoss:UpdateSecurityPolicy"
          ],
          "Effect": "Allow",
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "aoss:collection": "{collection-name}"
            }
          }
        }
      ]
    }

  4. Configure the role to assume the trust relationship, as follows:
    {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "osis-pipelines.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }

  5. It’s recommended to add the aws:SourceAccount and aws:SourceArn condition keys to the policy for protection against the confused deputy problem:
    "Condition": {
        "StringEquals": {
            "aws:SourceAccount": "{your-account-id}"
        },
        "ArnLike": {
            "aws:SourceArn": "arn:aws:osis:{region}:{your-account-id}:pipeline/*"
        }
    }

  6. Map the OpenSearch Ingestion domain role ARN as a backend user (as an all_access user) to the domain user. We show a simplified example to use the all_access role. For production scenarios, make sure to use a role with just enough permissions to read and write.
  7. Create an OpenSearch Serverless collection, which is where data will be ingested.
  8. Associate a data policy, as shown in the following code, to grant the OpenSearch Ingestion role permissions on the collection:
    [
      {
        "Rules": [
          {
            "Resource": [
              "index/collection-name/*"
            ],
            "Permission": [
              "aoss:CreateIndex",
              "aoss:UpdateIndex",
              "aoss:DescribeIndex",
              "aoss:WriteDocument",
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          "arn:aws:iam::{account-id}:role/pipeline-role"
        ],
        "Description": "Pipeline role access"
      }
    ]

  9. If the collection is defined as a VPC collection, you need to create a network policy and configure it in the ingestion pipeline.

Now you’re ready to move data from your provisioned domain to OpenSearch Serverless.

Move data from provisioned domains to Serverless

Setup Amazon OpenSearch Ingestion
To get started, you must have an active OpenSearch Service domain (source) and OpenSearch Serverless collection (sink). Complete the following steps to set up an OpenSearch Ingestion pipeline for migration:

  1. On the OpenSearch Service console, choose Pipeline under Ingestion in the navigation pane.
  2. Choose Create a pipeline.
  3. For Pipeline name, enter a name (for example, octank-migration).
  4. For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For now, you can leave the default minimum as 1 and maximum as 4.
  5. For Configuration Blueprint, select AWS-OpenSearchDataMigrationPipeline.
  6. Update the following information for the source:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Service endpoint.
    2. Uncomment distribution_version if your source cluster is an OpenSearch Service cluster with compatibility mode enabled; otherwise, leave it commented.
    3. Uncomment indices, include, index_name_regex, and add an index name or pattern that you want to migrate (for example, octank-iot-logs-2023.11.0*).
    4. Update region under aws where your source domain is (for example, us-west-2).
    5. Update sts_role_arn under aws to the role that has permission to read data from the OpenSearch Service domain (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as a backend role within the OpenSearch Service security roles.
  7. Update the following information for the sink:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless endpoint.
    2. Update sts_role_arn under aws to the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as part of the data access policy in the OpenSearch Serverless collection.
    3. Update the serverless flag to be true.
    4. For index, you can leave it as default, which will get the metadata from the source index and write to the same name in the destination as of the sources. Alternatively, if you want to have a different index name at the destination, modify this value with your desired name.
    5. For document_id, you can get the ID from the document metadata in the source and use the same in the target. Note that custom document IDs are supported only for the SEARCH type of collection; if you have your collection as TIMESERIES or VECTORSEARCH, you should comment this line.
  8. Next, you can validate your pipeline to check the connectivity of source and sink to make sure the endpoint exists and is accessible.
  9. For Network settings, choose your preferred setting:
    1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately.
    2. Choose Public to use public access. AWS recommends that you use a VPC endpoint for all production workloads, but this walkthrough, select Public.
  10. For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
  11. Choose Next, and verify the details you specified for your pipeline settings.
  12. Choose Create pipeline.

It should take a couple of minutes to create the ingestion pipeline.
The following graphic gives a quick demonstration of creating the OpenSearch Ingestion pipeline via the preceding steps.

Verify ingested data in the target OpenSearch Serverless collection

After the pipeline is created and active, log in to OpenSearch Dashboards for your OpenSearch Serverless collection and run the following command to list the indexes:

GET _cat/indices?v

The following graphic gives a quick demonstration of listing the indexes before and after the pipeline becomes active.

Conclusion

In this post, we saw how OpenSearch Ingestion can ingest data into an OpenSearch Serverless collection without the need to use the third-party solutions. With minimal data producer configuration, it automatically ingested data to the collection. OSI also allows you to transform or reindex the data from ES7.x version before ingestion to an OpenSearch Service domain or OpenSearch Serverless collection. OSI eliminates the need to provision, scale, or manage servers. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. The following OpenSearch Ingestion blueprints enable you to build data pipelines with minimal configuration changes.


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Rahul Sharma is a Technical Account Manager at Amazon Web Services. He is passionate about the data technologies that help leverage data as a strategic asset and is based out of New York city, New York.

Enable multi-admin support to manage security policies at scale with AWS Firewall Manager

Post Syndicated from Mun Hossain original https://aws.amazon.com/blogs/security/enable-multi-admin-support-to-manage-security-policies-at-scale-with-aws-firewall-manager/

The management of security services across organizations has evolved over the years, and can vary depending on the size of your organization, the type of industry, the number of services to be administered, and compliance regulations and legislation. When compliance standards require you to set up scoped administrative control of event monitoring and auditing, we find that single administrator support on management consoles can present several challenges for large enterprises. In this blog post, I’ll dive deep into these security policy management challenges and show how you can optimize your security operations at scale by using AWS Firewall Manager to support multiple administrators.

These are some of the use cases and challenges faced by large enterprise organizations when scaling their security operations:

Policy enforcement across complex organizational boundaries

Large organizations tend to be divided into multiple organizational units, each of which represents a function within the organization. Risk appetite, and therefore security policy, can vary dramatically between organizational units. For example, organizations may support two types of users: central administrators and app developers, both of whom can administer security policy but might do so at different levels of granularity. The central admin applies a baseline and relatively generic policy for all accounts, while the app developer can be made an admin for specific accounts and be allowed to create custom rules for the overall policy. A single administrator interface limits the ability for multiple administrators to enforce differing policies for the organizational unit to which they are assigned.

Lack of adequate separation across services

The benefit of centralized management is that you can enforce a centralized policy across multiple services that the management console supports. However, organizations might have different administrators for each service. For example, the team that manages the firewall could be different than the team that manages a web application firewall solution. Aggregating administrative access that is confined to a single administrator might not adequately conform to the way organizations have services mapped to administrators.

Auditing and compliance

Most security frameworks call for auditing procedures, to gain visibility into user access, types of modifications to configurations, timestamps of incremental changes, and logs for periods of downtime. An organization might want only specific administrators to have access to certain functions. For example, each administrator might have specific compliance scope boundaries based on their knowledge of a particular compliance standard, thereby distributing the responsibility for implementation of compliance measures. Single administrator access greatly reduces the ability to discern the actions of different administrators in that single account, making auditing unnecessarily complex.

Availability

Redundancy and resiliency are regarded as baseline requirements for security operations. Organizations want to ensure that if a primary administrator is locked out of a single account for any reason, other legitimate users are not affected in the same way. Single administrator access, in contrast, can lock out legitimate users from performing critical and time-sensitive actions on the management console.

Security risks

In a single administrator setting, the ability to enforce the policy of least privilege is not possible. This is because there are multiple operators who might share the same levels of access to the administrator account. This means that there are certain administrators who could be granted broader access than what is required for their function in the organization.

What is multi-admin support?

Multi-admin support for Firewall Manager allows customers with multiple organizational units (OUs) and accounts to create up to 10 Firewall Manager administrator accounts from AWS Organizations to manage their firewall policies. You can delegate responsibility for firewall administration at a granular scope by restricting access based on OU, account, policy type, and AWS Region, thereby enabling policy management tasks to be implemented more effectively.

Multi-admin support provides you the ability to use different administrator accounts to create administrative scopes for different parameters. Examples of these administrative scopes are included in the following table.

Administrator Scope
Default Administrator Full Scope (Default)
Administrator 1 OU = “Test 1”
Administrator 2 Account IDs = “123456789, 987654321”
Administrator 3 Policy-Type = “Security Group”
Administrator 4 Region = “us-east-2”

Benefits of multi-admin support

Multi-admin support helps alleviate many of the challenges just discussed by allowing administrators the flexibility to implement custom configurations based on job functions, while enforcing the principle of least privilege to help ensure that corporate policy and compliance requirements are followed. The following are some of the key benefits of multi-admin support:

Improved security

Security is enhanced, given that the principle of least privilege can be enforced in a multi-administrator access environment. This is because the different administrators using Firewall Manager will be using delegated privileges that are appropriate for the level of access they are permitted. The result is that the scope for user errors, intentional errors, and unauthorized changes can be significantly reduced. Additionally, you attain an added level of accountability for administrators.

Autonomy of job functions

Companies with organizational units that have separate administrators are afforded greater levels of autonomy within their AWS Organizations accounts. The result is an increase in flexibility, where concurrent users can perform very different security functions.

Compliance benefits

It is easier to meet auditing requirements based on compliance standards in multi-admin accounts, because there is a greater level of visibility into user access and the functions performed on the services when compared to a multi-eyes approval workflow and approval of all policies by one omnipotent admin. This can simplify routine audits through the generation of reports that detail the chronology of security changes that are implemented by specific admins over time.

Administrator Availability

Multi-admin management support helps avoid the limitations of having a single point of access and enhances availability by providing multiple administrators with their own levels of access. This can result in fewer disruptions, especially during periods that require time-sensitive changes to be made to security configurations.

Integration with AWS Organizations

You can enable trusted access using either the Firewall Manager console or the AWS Organizations console. To do this, you sign in with your AWS Organizations management account and configure an account allocated for security tooling within the organization as the Firewall Manager administrator account. After this is done, subsequent multi-admin Firewall Manager operations can also be performed using AWS APIs. With accounts in an organization, you can quickly allocate resources, group multiple accounts, and apply governance policies to accounts or groups. This simplifies operational overhead for services that require cross-account management.

Key use cases

Multi-admin support in Firewall Manager unlocks several use cases pertaining to admin role-based access. The key use cases are summarized here.

Role-based access

Multi-admin support allows for different admin roles to be defined based on the job function of the administrator, relative to the service being managed. For example, an administrator could be tasked to manage network firewalls to protect their VPCs, and a different administrator could be tasked to manage web application firewalls (AWS WAF), both using Firewall Manager.

User tracking and accountability

In a multi-admin configuration environment, each Firewall Manager administrator’s activities are logged and recorded according to corporate compliance standards. This is useful when dealing with the troubleshooting of security incidents, and for compliance with auditing standards.

Compliance with security frameworks

Regulations specific to a particular industry, such as Payment Card Industry (PCI), and industry-specific legislation, such as HIPAA, require restricted access, control, and separation of tasks for different job functions. Failure to adhere to such standards could result in penalties. With administrative scope extending to policy types, customers can assign responsibility for managing particular firewall policies according to user role guidelines, as specified in compliance frameworks.

Region-based privileges

Many state or federal frameworks, such as the California Consumer Privacy Act (CCPA), require that admins adhere to customized regional requirements, such as data sovereignty or privacy requirements. Multi-admin Firewall Manager support helps organizations to adopt these frameworks by making it easier to assign admins who are familiar with the regulations of a particular region to that region.

Figure 1: Use cases for multi-admin support on AWS Firewall Manager

Figure 1: Use cases for multi-admin support on AWS Firewall Manager

How to implement multi-admin support with Firewall Manager

To configure multi-admin support on Firewall Manager, use the following steps:

  1. In the AWS Organizations console of the organization’s managed account, expand the Root folder to view the various accounts in the organization. Select the Default Administrator account that is allocated to delegate Firewall Manager administrators. The Default Administrator account should be a dedicated security account separate from the AWS Organizations management account, such as a Security Tooling account.
     
    Figure 2: Overview of the AWS Organizations console

    Figure 2: Overview of the AWS Organizations console

  2. Navigate to Firewall Manager and in the left navigation menu, select Settings.
     
    Figure 3: AWS Firewall Manager settings to update policy types

    Figure 3: AWS Firewall Manager settings to update policy types

  3. In Settings, choose an account. Under Policy types, select AWS Network Firewall to allow an admin to manage a specific firewall across accounts and across Regions. Select Edit to show the Details menu in Figure 4.
     
    Figure 4: Select AWS Network Firewall as a policy type that can be managed by this administration account

    Figure 4: Select AWS Network Firewall as a policy type that can be managed by this administration account

    The results of your selection are shown in Figure 5. The admin has been granted privileges to set AWS Network Firewall policy across all Regions and all accounts.
     

    Figure 5: The admin has been granted privileges to set Network Firewall policy across all Regions and all accounts

    Figure 5: The admin has been granted privileges to set Network Firewall policy across all Regions and all accounts

  4. In this second use case, you will identify a number of sub-accounts that the admin should be restricted to. As shown in Figure 6, there are no sub-accounts or OUs that the admin is restricted to by default until you choose Edit and select them.
     
    Figure 6: The administrative scope details for the admin

    Figure 6: The administrative scope details for the admin

    In order to achieve this second use case, you choose Edit, and then add multiple sub-accounts or an OU that you need the admin restricted to, as shown in Figure 7.
     

    Figure 7: Add multiple sub-accounts or an OU that you need the admin restricted to

    Figure 7: Add multiple sub-accounts or an OU that you need the admin restricted to

  5. The third use case pertains to granting the admin privileges to a particular AWS Region. In this case, you go into the Edit administrator account page once more, but this time, for Regions, select US West (N California) in order to restrict admin privileges only to this selected Region.
     
    Figure 8: Restricting admin privileges only to the US West (N California) Region

    Figure 8: Restricting admin privileges only to the US West (N California) Region

Conclusion

Large enterprises need strategies for operationalizing security policy management so that they can enforce policy across organizational boundaries, deal with policy changes across security services, and adhere to auditing and compliance requirements. Multi-admin support in Firewall Manager provides a framework that admins can use to organize their workflow across job roles, to help maintain appropriate levels of security while providing the autonomy that admins desire.

You can get started using the multi-admin feature with Firewall Manager using the AWS Management Console. To learn more, refer to the service documentation.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Mun Hossain

Mun Hossain

Mun is a Principal Security Service Specialist at AWS. Mun sets go-to-market strategies and prioritizes customer signals that contribute to service roadmap direction. Before joining AWS, Mun was a Senior Director of Product Management for a wide range of cybersecurity products at Cisco. Mun holds an MS in Cybersecurity Risk and Strategy and an MBA.

How to use Regional AWS STS endpoints

Post Syndicated from Darius Januskis original https://aws.amazon.com/blogs/security/how-to-use-regional-aws-sts-endpoints/

This blog post provides recommendations that you can use to help improve resiliency in the unlikely event of disrupted availability of the global (now legacy) AWS Security Token Service (AWS STS) endpoint. Although the global (legacy) AWS STS endpoint https://sts.amazonaws.com is highly available, it’s hosted in a single AWS Region—US East (N. Virginia)—and like other endpoints, it doesn’t provide automatic failover to endpoints in other Regions. In this post I will show you how to use Regional AWS STS endpoints in your configurations to improve the performance and resiliency of your workloads.

For authentication, it’s best to use temporary credentials instead of long-term credentials to help reduce risks, such as inadvertent disclosure, sharing, or theft of credentials. With AWS STS, trusted users can request temporary, limited-privilege credentials to access AWS resources.

Temporary credentials include an access key pair and a session token. The access key pair consists of an access key ID and a secret key. AWS STS generates temporary security credentials dynamically and provides them to the user when requested, which eliminates the need for long-term storage. Temporary security credentials have a limited lifetime so you don’t have to manage or rotate them.

To get these credentials, you can use several different methods:

Figure 1: Methods to request credentials from AWS STS

Figure 1: Methods to request credentials from AWS STS

Global (legacy) and Regional AWS STS endpoints

To connect programmatically to an AWS service, you use an endpoint. An endpoint is the URL of the entry point for AWS STS.

AWS STS provides Regional endpoints in every Region. AWS initially built AWS STS with a global endpoint (now legacy) https://sts.amazonaws.com, which is hosted in the US East (N. Virginia) Region (us-east-1). Regional AWS STS endpoints are activated by default for Regions that are enabled by default in your AWS account. For example, https://sts.us-east-2.amazonaws.com is the US East (Ohio) Regional endpoint. By default, AWS services use Regional AWS STS endpoints. For example, IAM Roles Anywhere uses the Regional STS endpoint that corresponds to the trust anchor. For a complete list of AWS STS endpoints for each Region, see AWS Security Token Service endpoints and quotas. You can’t activate an AWS STS endpoint in a Region that is disabled. For more information on which AWS STS endpoints are activated by default and which endpoints you can activate or deactivate, see Regions and endpoints.

As noted previously, the global (legacy) AWS STS endpoint https://sts.amazonaws.com is hosted in a single Region — US East (N. Virginia) — and like other endpoints, it doesn’t provide automatic failover to endpoints in other Regions. If your workloads on AWS or outside of AWS are configured to use the global (legacy) AWS STS endpoint https://sts.amazonaws.com, you introduce a dependency on a single Region: US East (N. Virginia). In the unlikely event that the endpoint becomes unavailable in that Region or connectivity between your resources and that Region is lost, your workloads won’t be able to use AWS STS to retrieve temporary credentials, which poses an availability risk to your workloads.

AWS recommends that you use Regional AWS STS endpoints (https://sts.<region-name>.amazonaws.com) instead of the global (legacy) AWS STS endpoint.

In addition to improved resiliency, Regional endpoints have other benefits:

  • Isolation and containment — By making requests to an AWS STS endpoint in the same Region as your workloads, you can minimize cross-Region dependencies and align the scope of your resources with the scope of your temporary security credentials to help address availability and security concerns. For example, if your workloads are running in the US East (Ohio) Region, you can target the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2) to remove dependencies on other Regions.
  • Performance — By making your AWS STS requests to an endpoint that is closer to your services and applications, you can access AWS STS with lower latency and shorter response times.

Figure 2 illustrates the process for using an AWS principal to assume an AWS Identity and Access Management (IAM) role through the AWS STS AssumeRole API, which returns a set of temporary security credentials:

Figure 2: Assume an IAM role by using an API call to a Regional AWS STS endpoint

Figure 2: Assume an IAM role by using an API call to a Regional AWS STS endpoint

Calls to AWS STS within the same Region

You should configure your workloads within a specific Region to use only the Regional AWS STS endpoint for that Region. By using a Regional endpoint, you can use AWS STS in the same Region as your workloads, removing cross-Region dependency. For example, workloads in the US East (Ohio) Region should use only the Regional endpoint https://sts.us-east-2.amazonaws.com to call AWS STS. If a Regional AWS STS endpoint becomes unreachable, your workloads shouldn’t call AWS STS endpoints outside of the operating Region. If your workload has a multi-Region resiliency requirement, your other active or standby Region should use a Regional AWS STS endpoint for that Region and should be deployed such that the application can function despite a Regional failure. You should direct STS traffic to the STS endpoint within the same Region, isolated and independent from other Regions, and remove dependencies on the global (legacy) endpoint.

Calls to AWS STS from outside AWS

You should configure your workloads outside of AWS to call the appropriate Regional AWS STS endpoints that offer the lowest latency to your workload located outside of AWS. If your workload has a multi-Region resiliency requirement, build failover logic for AWS STS calls to other Regions in the event that Regional AWS STS endpoints become unreachable. Temporary security credentials obtained from Regional AWS STS endpoints are valid globally for the default session duration or duration that you specify.

How to configure Regional AWS STS endpoints for your tools and SDKs

I recommend that you use the latest major versions of the AWS Command Line Interface (CLI) or AWS SDK to call AWS STS APIs.

AWS CLI

By default, the AWS CLI version 2 sends AWS STS API requests to the Regional AWS STS endpoint for the currently configured Region. If you are using AWS CLI v2, you don’t need to make additional changes.

By default, the AWS CLI v1 sends AWS STS requests to the global (legacy) AWS STS endpoint. To check the version of the AWS CLI that you are using, run the following command: $ aws –version.

When you run AWS CLI commands, the AWS CLI looks for credential configuration in a specific order—first in shell environment variables and then in the local AWS configuration file (~/.aws/config).

AWS SDK

AWS SDKs are available for a variety of programming languages and environments. Since July 2022, major new versions of the AWS SDK default to Regional AWS STS endpoints and use the endpoint corresponding to the currently configured Region. If you use a major version of the AWS SDK that was released after July 2022, you don’t need to make additional changes.

An AWS SDK looks at various configuration locations until it finds credential configuration values. For example, the AWS SDK for Python (Boto3) adheres to the following lookup order when it searches through sources for configuration values:

  1. A configuration object created and passed as the AWS configuration parameter when creating a client
  2. Environment variables
  3. The AWS configuration file ~/.aws/config

If you still use AWS CLI v1, or your AWS SDK version doesn’t default to a Regional AWS STS endpoint, you have the following options to set the Regional AWS STS endpoint:

Option 1 — Use a shared AWS configuration file setting

The configuration file is located at ~/.aws/config on Linux or macOS, and at C:\Users\USERNAME\.aws\config on Windows. To use the Regional endpoint, add the sts_regional_endpoints parameter.

The following example shows how you can set the value for the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2), by using the default profile in the AWS configuration file:

[default]
region = us-east-2
sts_regional_endpoints = regional

The valid values for the AWS STS endpoint parameter (sts_regional_endpoints) are:

  • legacy (default) — Uses the global (legacy) AWS STS endpoint, sts.amazonaws.com.
  • regional — Uses the AWS STS endpoint for the currently configured Region.

Note: Since July 2022, major new versions of the AWS SDK default to Regional AWS STS endpoints and use the endpoint corresponding to the currently configured Region. If you are using AWS CLI v1, you must use version 1.16.266 or later to use the AWS STS endpoint parameter.

You can use the --debug option with the AWS CLI command to receive the debug log and validate which AWS STS endpoint was used.

$ aws sts get-caller-identity \
$ --region us-east-2 \
$ --debug

If you search for UseGlobalEndpoint in your debug log, you’ll find that the UseGlobalEndpoint parameter is set to False, and you’ll see the Regional endpoint provider fully qualified domain name (FQDN) when the Regional AWS STS endpoint is configured in a shared AWS configuration file or environment variables:

2023-09-11 18:51:11,300 – MainThread – botocore.regions – DEBUG – Calling endpoint provider with parameters: {'Region': 'us-east-2', 'UseDualStack': False, 'UseFIPS': False, 'UseGlobalEndpoint': False}
2023-09-11 18:51:11,300 – MainThread – botocore.regions – DEBUG – Endpoint provider result: https://sts.us-east-2.amazonaws.com

For a list of AWS SDKs that support shared AWS configuration file settings for Regional AWS STS endpoints, see Compatibility with AWS SDKS.

Option 2 — Use environment variables

Environment variables provide another way to specify configuration options. They are global and affect calls to AWS services. Most SDKs support environment variables. When you set the environment variable, the SDK uses that value until the end of your shell session or until you set the variable to a different value. To make the variables persist across future sessions, set them in your shell’s startup script.

The following example shows how you can set the value for the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2) by using environment variables:

Linux or macOS

$ export AWS_DEFAULT_REGION=us-east-2
$ export AWS_STS_REGIONAL_ENDPOINTS=regional

You can run the command $ (echo $AWS_DEFAULT_REGION; echo $AWS_STS_REGIONAL_ENDPOINTS) to validate the variables. The output should look similar to the following:

us-east-2
regional

Windows

C:\> set AWS_DEFAULT_REGION=us-east-2
C:\> set AWS_STS_REGIONAL_ENDPOINTS=regional

The following example shows how you can configure an STS client with the AWS SDK for Python (Boto3) to use a Regional AWS STS endpoint by setting the environment variable:

import boto3
import os
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"
os.environ["AWS_STS_REGIONAL_ENDPOINTS"] = "regional"

You can use the metadata attribute sts_client.meta.endpoint_url to inspect and validate how an STS client is configured. The output should look similar to the following:

>>> sts_client = boto3.client("sts")
>>> sts_client.meta.endpoint_url
'https://sts.us-east-2.amazonaws.com'

For a list of AWS SDKs that support environment variable settings for Regional AWS STS endpoints, see Compatibility with AWS SDKs.

Option 3 — Construct an endpoint URL

You can also manually construct an endpoint URL for a specific Regional AWS STS endpoint.

The following example shows how you can configure the STS client with AWS SDK for Python (Boto3) to use a Regional AWS STS endpoint by setting a specific endpoint URL:

import boto3
sts_client = boto3.client('sts', region_name='us-east-2', endpoint_url='https://sts.us-east-2.amazonaws.com')

Use a VPC endpoint with AWS STS

You can create a private connection to AWS STS from the resources that you deployed in your Amazon VPCs. AWS STS integrates with AWS PrivateLink by using interface VPC endpoints. The network traffic on AWS PrivateLink stays on the global AWS network backbone and doesn’t traverse the public internet. When you configure a VPC endpoint for AWS STS, the traffic for the Regional AWS STS endpoint traverses to that endpoint.

By default, the DNS in your VPC will update the entry for the Regional AWS STS endpoint to resolve to the private IP address of the VPC endpoint for AWS STS in your VPC. The following output from an Amazon Elastic Compute Cloud (Amazon EC2) instance shows the DNS name for the AWS STS endpoint resolving to the private IP address of the VPC endpoint for AWS STS:

[ec2-user@ip-10-120-136-166 ~]$ nslookup sts.us-east-2.amazonaws.com
Server:         10.120.0.2
Address:        10.120.0.2#53

Non-authoritative answer:
Name:   sts.us-east-2.amazonaws.com
Address: 10.120.138.148

After you create an interface VPC endpoint for AWS STS in your Region, set the value for the respective Regional AWS STS endpoint by using environment variables to access AWS STS in the same Region.

The output of the following log shows that an AWS STS call was made to the Regional AWS STS endpoint:

POST
/

content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.us-east-2.amazonaws.com

Log AWS STS requests

You can use AWS CloudTrail events to get information about the request and endpoint that was used for AWS STS. This information can help you identify AWS STS request patterns and validate if you are still using the global (legacy) STS endpoint.

An event in CloudTrail is the record of an activity in an AWS account. CloudTrail events provide a history of both API and non-API account activity made through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.

Log locations

  • Requests to Regional AWS STS endpoints sts.<region-name>.amazonaws.com are logged in CloudTrail within their respective Region.
  • Requests to the global (legacy) STS endpoint sts.amazonaws.com are logged within the US East (N. Virginia) Region (us-east-1).

Log fields

  • Requests to Regional AWS STS endpoints and global endpoint are logged in the tlsDetails field in CloudTrail. You can use this field to determine if the request was made to a Regional or global (legacy) endpoint.
  • Requests made from a VPC endpoint are logged in the vpcEndpointId field in CloudTrail.

The following example shows a CloudTrail event for an STS request to a Regional AWS STS endpoint with a VPC endpoint.

"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "123456789012",
"vpcEndpointId": "vpce-021345abcdef6789",
"eventCategory": "Management",
"tlsDetails": {
    "tlsVersion": "TLSv1.2",
    "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
    "clientProvidedHostHeader": "sts.us-east-2.amazonaws.com"
}

The following example shows a CloudTrail event for an STS request to the global (legacy) AWS STS endpoint.

"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "123456789012",
"eventCategory": "Management",
"tlsDetails": {
    "tlsVersion": "TLSv1.2",
    "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
    "clientProvidedHostHeader": "sts.amazonaws.com"
}

To interactively search and analyze your AWS STS log data, use AWS CloudWatch Logs Insights or Amazon Athena.

CloudWatch Logs Insights

The following example shows how to run a CloudWatch Logs Insights query to look for API calls made to the global (legacy) AWS STS endpoint. Before you can query CloudTrail events, you must configure a CloudTrail trail to send events to CloudWatch Logs.

filter eventSource="sts.amazonaws.com" and tlsDetails.clientProvidedHostHeader="sts.amazonaws.com"
| fields eventTime, recipientAccountId, eventName, tlsDetails.clientProvidedHostHeader, sourceIPAddress, userIdentity.arn, @message
| sort eventTime desc

The query output shows event details for an AWS STS call made to the global (legacy) AWS STS endpoint https://sts.amazonaws.com.

Figure 3: Use a CloudWatch Log Insights query to look for STS API calls

Figure 3: Use a CloudWatch Log Insights query to look for STS API calls

Amazon Athena

The following example shows how to query CloudTrail events with Amazon Athena and search for API calls made to the global (legacy) AWS STS endpoint.

SELECT
    eventtime,
    recipientaccountid,
    eventname,
    tlsdetails.clientProvidedHostHeader,
    sourceipaddress,
    eventid,
    useridentity.arn
FROM "cloudtrail_logs"
WHERE
    eventsource = 'sts.amazonaws.com' AND
    tlsdetails.clientProvidedHostHeader = 'sts.amazonaws.com'
ORDER BY eventtime DESC

The query output shows STS calls made to the global (legacy) AWS STS endpoint https://sts.amazonaws.com.

Figure 4: Use Athena to search for STS API calls and identify STS endpoints

Figure 4: Use Athena to search for STS API calls and identify STS endpoints

Conclusion

In this post, you learned how to use Regional AWS STS endpoints to help improve resiliency, reduce latency, and increase session token usage for the operating Regions in your AWS environment.

AWS recommends that you check the configuration and usage of AWS STS endpoints in your environment, validate AWS STS activity in your CloudTrail logs, and confirm that Regional AWS STS endpoints are used.

If you have questions, post them in the Security Identity and Compliance re:Post topic or reach out to AWS Support.

Want more AWS Security news? Follow us on X.

Darius Januskis

Darius is a Senior Solutions Architect at AWS helping global financial services customers in their journey to the cloud. He is a passionate technology enthusiast who enjoys working with customers and helping them build well-architected solutions. His core interests include security, DevOps, automation, and serverless technologies.

How to automate rule management for AWS Network Firewall

Post Syndicated from Ajinkya Patil original https://aws.amazon.com/blogs/security/how-to-automate-rule-management-for-aws-network-firewall/

AWS Network Firewall is a stateful managed network firewall and intrusion detection and prevention service designed for the Amazon Virtual Private Cloud (Amazon VPC). This post concentrates on automating rule updates in a central Network Firewall by using distributed firewall configurations. If you’re new to Network Firewall or seeking a technical background on rule management, see AWS Network Firewall – New Managed Firewall Service in VPC.

Network Firewall offers three deployment models: Distributed, centralized, and combined. Many customers opt for a centralized model to reduce costs. In this model, customers allocate the responsibility for managing the rulesets to the owners of the VPC infrastructure (spoke accounts) being protected, thereby shifting accountability and providing flexibility to the spoke accounts. Managing rulesets in a shared firewall policy generated from distributed input configurations of protected VPCs (spoke accounts) is challenging without proper input validation, state-management, and request throttling controls.

In this post, we show you how to automate firewall rule management within the central firewall using distributed firewall configurations spread across multiple AWS accounts. The anfw-automate solution provides input-validation, state-management, and throttling controls, reducing the update time for firewall rule changes from minutes to seconds. Additionally, the solution reduces operational costs, including rule management overhead while integrating seamlessly with the existing continuous integration and continuous delivery (CI/CD) processes.

Prerequisites

For this walkthrough, the following prerequisites must be met:

  • Basic knowledge of networking concepts such as routing and Classless Inter-Domain Routing (CIDR) range allocations.
  • Basic knowledge of YAML and JSON configuration formats, definitions, and schema.
  • Basic knowledge of Suricata Rule Format and Network Firewall rule management.
  • Basic knowledge of CDK deployment.
  • AWS Identity and Access Management (IAM) permissions to bootstrap the AWS accounts using AWS Cloud Development Kit (AWS CDK).
  • The firewall VPC in the central account must be reachable from a spoke account (see centralized deployment model). For this solution, you need two AWS accounts from the centralized deployment model:
    • The spoke account is the consumer account the defines firewall rules for the account and uses central firewall endpoints for traffic filtering. At least one spoke account is required to simulate the user workflow in validation phase.
    • The central account is an account that contains the firewall endpoints. This account is used by application and the Network Firewall.
  • StackSets deployment with service-managed permissions must be enabled in AWS Organizations (Activate trusted access with AWS Organizations). A delegated administrator account is required to deploy AWS CloudFormation stacks in any account in an organization. The CloudFormation StackSets in this account deploy the necessary CloudFormation stacks in the spoke accounts. If you don’t have a delegated administrator account, you must manually deploy the resources in the spoke account. Manual deployment isn’t recommended in production environments.
  • A resource account is the CI/CD account used to deploy necessary AWS CodePipeline stacks. The pipelines deploy relevant cross-account cross-AWS Region stacks to the preceding AWS accounts.
    • IAM permissions to deploy CDK stacks in the resource account.

Solution description

In Network Firewall, each firewall endpoint connects to one firewall policy, which defines network traffic monitoring and filtering behavior. The details of the behavior are defined in rule groups — a reusable set of rules — for inspecting and handling network traffic. The rules in the rule groups provide the details for packet inspection and specify the actions to take when a packet matches the inspection criteria. Network Firewall uses a Suricata rules engine to process all stateful rules. Currently, you can create Suricata compatible or basic rules (such as domain list) in Network Firewall. We use Suricata compatible rule strings within this post to maintain maximum compatibility with most use cases.

Figure 1 describes how the anfw-automate solution uses the distributed firewall rule configurations to simplify rule management for multiple teams. The rules are validated, transformed, and stored in the central AWS Network Firewall policy. This solution isolates the rule generation to the spoke AWS accounts, but still uses a shared firewall policy and a central ANFW for traffic filtering. This approach grants the AWS spoke account owners the flexibility to manage their own firewall rules while maintaining the accountability for their rules in the firewall policy. The solution enables the central security team to validate and override user defined firewall rules before pushing them to the production firewall policy. The security team operating the central firewall can also define additional rules that are applied to all spoke accounts, thereby enforcing organization-wide security policies. The firewall rules are then compiled and applied to Network Firewall in seconds, providing near real-time response in scenarios involving critical security incidents.

Figure 1: Workflow launched by uploading a configuration file to the configuration (config) bucket

Figure 1: Workflow launched by uploading a configuration file to the configuration (config) bucket

The Network Firewall firewall endpoints and anfw-automate solution are both deployed in the central account. The spoke accounts use the application for rule automation and the Network Firewall for traffic inspection.

As shown in Figure 1, each spoke account contains the following:

  1. An Amazon Simple Storage Service (Amazon S3) bucket to store multiple configuration files, one per Region. The rules defined in the configuration files are applicable to the VPC traffic in the spoke account. The configuration files must comply with the defined naming convention ($Region-config.yaml) and be validated to make sure that only one configuration file exists per Region per account. The S3 bucket has event notifications enabled that publish all changes to configuration files to a local default bus in Amazon EventBridge.
  2. EventBridge rules to monitor the default bus and forward relevant events to the custom event bus in the central account. The EventBridge rules specifically monitor VPCDelete events published by Amazon CloudTrail and S3 event notifications. When a VPC is deleted from the spoke account, the VPCDelete events lead to the removal of corresponding rules from the firewall policy. Additionally, all create, update, and delete events from Amazon S3 event notifications invoke corresponding actions on the firewall policy.
  3. Two AWS Identity and Access Manager (IAM) roles with keywords xaccount.lmb.rc and xaccount.lmb.re are assumed by RuleCollect and RuleExecute functions in the central account, respectively.
  4. A CloudWatch Logs log group to store event processing logs published by the central AWS Lambda application.

In the central account:

  1. EventBridge rules monitor the custom event bus and invoke a Lambda function called RuleCollect. A dead-letter queue is attached to the EventBridge rules to store events that failed to invoke the Lambda function.
  2. The RuleCollect function retrieves the config file from the spoke account by assuming a cross-account role. This role is deployed by the same stack that created the other spoke account resources. The Lambda function validates the request, transforms the request to the Suricata rule syntax, and publishes the rules to an Amazon Simple Queue Service (Amazon SQS) first-in-first-out (FIFO) queue. Input validation controls are paramount to make sure that users don’t abuse the functionality of the solution and bypass central governance controls. The Lambda function has input validation controls to verify the following:
    • The VPC ID in the configuration file exists in the configured Region and the same AWS account as the S3 bucket.
    • The Amazon S3 object version ID received in the event matches the latest version ID to mitigate race conditions.
    • Users don’t have only top-level domains (for example, .com, .de) in the rules.
    • The custom Suricata rules don’t have any as the destination IP address or domain.
    • The VPC identifier matches the required format, that is, a+(AWS Account ID)+(VPC ID without vpc- prefix) in custom rules. This is important to have unique rule variables in rule groups.
    • The rules don’t use security sensitive keywords such as sid, priority, or metadata. These keywords are reserved for firewall administrators and the Lambda application.
    • The configured VPC is attached to an AWS Transit Gateway.
    • Only pass rules exist in the rule configuration.
    • CIDR ranges for a VPC are mapped appropriately using IP set variables.

    The input validations make sure that rules defined by one spoke account don’t impact the rules from other spoke accounts. The validations applied to the firewall rules can be updated and managed as needed based on your requirements. The rules created must follow a strict format, and deviation from the preceding rules will lead to the rejection of the request.

  3. The Amazon SQS FIFO queue preserves the order of create, update, and delete operations run in the configuration bucket of the spoke account. These state-management controls maintain consistency between the firewall rules in the configuration file within the S3 bucket and the rules in the firewall policy. If the sequence of updates provided by the distributed configurations isn’t honored, the rules in a firewall policy might not match the expected ruleset.

    Rules not processed beyond the maxReceiveCount threshold are moved to a dead-letter SQS queue for troubleshooting.

  4. The Amazon SQS messages are subsequently consumed by another Lambda function called RuleExecute. Multiple changes to one configuration are batched together in one message. The RuleExecute function parses the messages and generates the required rule groups, IP set variables, and rules within the Network Firewall. Additionally, the Lambda function establishes a reserved rule group, which can be administered by the solution’s administrators and used to define global rules. The global rules, applicable to participating AWS accounts, can be managed in the data/defaultdeny.yaml file by the central security team.

    The RuleExecute function also implements throttling controls to make sure that rules are applied to the firewall policy without reaching the ThrottlingException from Network Firewall (see common errors). The function also implements back-off logic to handle this exception. This throttling effect can happen if there are too many requests issued to the Network Firewall API.

    The function makes cross-Region calls to Network Firewall based on the Region provided in the user configuration. There is no need to deploy the RuleExecute and RuleCollect Lambda functions in multiple Regions unless a use case warrants it.

Walkthrough

The following section guides you through the deployment of the rules management engine.

  • Deployment: Outlines the steps to deploy the solution into the target AWS accounts.
  • Validation: Describes the steps to validate the deployment and ensure the functionality of the solution.
  • Cleaning up: Provides instructions for cleaning up the deployment.

Deployment

In this phase, you deploy the application pipeline in the resource account. The pipeline is responsible for deploying multi-Region cross-account CDK stacks in both the central account and the delegated administrator account.

If you don’t have a functioning Network Firewall firewall using the centralized deployment model in the central account, see the README for instructions on deploying Amazon VPC and Network Firewall stacks before proceeding. You need to deploy the Network Firewall in centralized deployment in each Region and Availability Zone used by spoke account VPC infrastructure.

The application pipeline stack deploys three stacks in all configured Regions: LambdaStack and ServerlessStack in the central account and StacksetStack in the delegated administrator account. It’s recommended to deploy these stacks solely in the primary Region, given that the solution can effectively manage firewall policies across all supported Regions.

  • LambdaStack deploys the RuleCollect and RuleExecute Lambda functions, Amazon SQS FIFO queue, and SQS FIFO dead-letter queue.
  • ServerlessStack deploys EventBridge bus, EventBridge rules, and EventBridge Dead-letter queue.
  • StacksetStack deploys a service-managed stack set in the delegated administrator account. The stack set includes the deployment of IAM roles, EventBridge rules, an S3 Bucket, and a CloudWatch log group in the spoke account. If you’re manually deploying the CloudFormation template (templates/spoke-serverless-stack.yaml) in the spoke account, you have the option to disable this stack in the application configuration.
     
    Figure 2: CloudFormation stacks deployed by the application pipeline

    Figure 2: CloudFormation stacks deployed by the application pipeline

To prepare for bootstrapping

  1. Install and configure profiles for all AWS accounts using Amazon Command Line Interface (AWS CLI)
  2. Install the Cloud Development Kit (CDK)
  3. Install Git and clone the GitHub repo
  4. Install and enable Docker Desktop

To prepare for deployment

  1. Follow the README and cdk bootstrapping guide to bootstrap the resource account. Then, bootstrap the central account and delegated administrator account (optional if StacksetStack is deployed manually in the spoke account) to trust the resource account. The spoke accounts don’t need to be bootstrapped.
  2. Create a folder to be referred to as <STAGE>, where STAGE is the name of your deployment stage — for example, local, dev, int, and so on — in the conf folder of the cloned repository. The deployment stage is set as the STAGE parameter later and used in the AWS resource names.
  3. Create global.json in the <STAGE> folder. Follow the README to update the parameter values. A sample JSON file is provided in conf/sample folder.
  4. Run the following commands to configure the local environment:
    npm install
    export STAGE=<STAGE>
    export AWS_REGION=<AWS_Region_to_deploy_pipeline_stack>

To deploy the application pipeline stack

  1. Create a file named app.json in the <STAGE> folder and populate the parameters in accordance with the README section and defined schema.
  2. If you choose to manage the deployment of spoke account stacks using the delegated administrator account and have set the deploy_stacksets parameter to true, create a file named stackset.json in the <STAGE> folder. Follow the README section to align with the requirements of the defined schema.

    You can also deploy the spoke account stack manually for testing using the AWS CloudFormation template in templates/spoke-serverless-stack.yaml. This will create and configure the needed spoke account resources.

  3. Run the following commands to deploy the application pipeline stack:
    export STACKNAME=app && make deploy

    Figure 3: Example output of application pipeline deployment

    Figure 3: Example output of application pipeline deployment

After deploying the solution, each spoke account is required to configure stateful rules for every VPC in the configuration file and upload it to the S3 bucket. Each spoke account owner must verify the VPC’s connection to the firewall using the centralized deployment model. The configuration, presented in the YAML configuration language, might encompass multiple rule definitions. Each account must furnish one configuration file per VPC to establish accountability and non-repudiation.

Validation

Now that you’ve deployed the solution, follow the next steps to verify that it’s completed as expected, and then test the application.

To validate deployment

  1. Sign in to the AWS Management Console using the resource account and go to CodePipeline.
  2. Verify the existence of a pipeline named cpp-app-<aws_ organization_scope>-<project_name>-<module_name>-<STAGE> in the configured Region.
  3. Verify that stages exist in each pipeline for all configured Regions.
  4. Confirm that all pipeline stages exist. The LambdaStack and ServerlessStack stages must exist in the cpp-app-<aws_organization_scope>-<project_name>-<module_name>-<STAGE> stack. The StacksetStack stage must exist if you set the deploy_stacksets parameter to true in global.json.

To validate the application

  1. Sign in and open the Amazon S3 console using the spoke account.
  2. Follow the schema defined in app/RuleCollect/schema.json and create a file with naming convention ${Region}-config.yaml. Note that the Region in the config file is the destination Region for the firewall rules. Verify that the file has valid VPC data and rules.
    Figure 4: Example configuration file for eu-west-1 Region

    Figure 4: Example configuration file for eu-west-1 Region

  3. Upload the newly created config file to the S3 bucket named anfw-allowlist-<AWS_REGION for application stack>-<Spoke Account ID>-<STAGE>.
  4. If the data in the config file is invalid, you will see ERROR and WARN logs in the CloudWatch log group named cw-<aws_organization_scope>-<project_name>-<module_name>-CustomerLog-<STAGE>.
  5. If all the data in the config file is valid, you will see INFO logs in the same CloudWatch log group.
    Figure 5: Example of logs generated by the anfw-automate in a spoke account

    Figure 5: Example of logs generated by the anfw-automate in a spoke account

  6. After the successful processing of the rules, sign in to the Network Firewall console using the central account.
  7. Navigate to the Network Firewall rule groups and search for a rule group with a randomly assigned numeric name. This rule group will contain your Suricata rules after the transformation process.
    Figure 6: Rules created in Network Firewall rule group based on the configuration file in Figure 4

    Figure 6: Rules created in Network Firewall rule group based on the configuration file in Figure 4

  8. Access the Network Firewall rule group identified by the suffix reserved. This rule group is designated for administrators and global rules. Confirm that the rules specified in app/data/defaultdeny.yaml have been transformed into Suricata rules and are correctly placed within this rule group.
  9. Instantiate an EC2 instance in the VPC specified in the configuration file and try to access both the destinations allowed in the file and any destination not listed. Note that requests to destinations not defined in the configuration file are blocked.

Cleaning up

To avoid incurring future charges, remove all stacks and instances used in this walkthrough.

  1. Sign in to both the central account and the delegated admin account. Manually delete the stacks in the Regions configured for the app parameter in global.json. Ensure that the stacks are deleted for all Regions specified for the app parameter. You can filter the stack names using the keyword <aws_organization_scope>-<project_name>-<module_name> as defined in global.json.
  2. After deleting the stacks, remove the pipeline stacks using the same command as during deployment, replacing cdk deploy with cdk destroy.
  3. Terminate or stop the EC2 instance used to test the application.

Conclusion

This solution simplifies network security by combining distributed ANFW firewall configurations in a centralized policy. Automated rule management can help reduce operational overhead, reduces firewall change request completion times from minutes to seconds, offloads security and operational mechanisms such as input validation, state-management, and request throttling, and enables central security teams to enforce global firewall rules without compromising on the flexibility of user-defined rulesets.

In addition to using this application through S3 bucket configuration management, you can integrate this tool with GitHub Actions into your CI/CD pipeline to upload the firewall rule configuration to an S3 bucket. By combining GitHub actions, you can automate configuration file updates with automated release pipeline checks, such as schema validation and manual approvals. This enables your team to maintain and change firewall rule definitions within your existing CI/CD processes and tools. You can go further by allowing access to the S3 bucket only through the CI/CD pipeline.

Finally, you can ingest the AWS Network Firewall logs into one of our partner solutions for security information and event management (SIEM), security monitoring, threat intelligence, and managed detection and response (MDR). You can launch automatic rule updates based on security events detected by these solutions, which can help reduce the response time for security events.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Ajinkya Patil

Ajinkya Patil

Ajinkya is a Security Consultant at Amazon Professional Services, specializing in security consulting for AWS customers within the automotive industry since 2019. He has presented at AWS re:Inforce and contributed articles to the AWS Security blog and AWS Prescriptive Guidance. Beyond his professional commitments, he indulges in travel and photography.

Stephan Traub

Stephan Traub

Stephan is a Security Consultant working for automotive customers at AWS Professional Services. He is a technology enthusiast and passionate about helping customers gain a high security bar in their cloud infrastructure. When Stephan isn’t working, he’s playing volleyball or traveling with his family around the world.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

Monitoring data pipelines in real time is critical for catching issues early and minimizing disruptions. AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics, which provide valuable insights into your data integration pipelines built on AWS Glue. However, you might need to track key performance indicators across multiple jobs. In this case, a dashboard that can visualize the same metrics with the ability to drill down into individual issues is an effective solution to monitor at scale.

This post, walks through how to integrate AWS Glue job observability metrics with Grafana using Amazon Managed Grafana. We discuss the types of metrics and charts available to surface key insights along with two use cases on monitoring error classes and throughput of your AWS Glue jobs.

Solution overview

Grafana is an open source visualization tool that allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. With Grafana, you can create, explore, and share visually rich, data-driven dashboards. The new AWS Glue job observability metrics can be effortlessly integrated with Grafana for real-time monitoring purpose. Metrics like worker utilization, skewness, I/O rate, and errors are captured and visualized in easy-to-read Grafana dashboards. The integration with Grafana provides a flexible way to build custom views of pipeline health tailored to your needs. Observability metrics open up monitoring capabilities that weren’t possible before for AWS Glue. Companies relying on AWS Glue for critical data integration pipelines can have greater confidence that their pipelines are running efficiently.

AWS Glue job observability metrics are emitted as Amazon CloudWatch metrics. You can provision and manage Amazon Managed Grafana, and configure the CloudWatch plugin for the given metrics. The following diagram illustrates the solution architecture.

Implement the solution

Complete following steps to set up the solution:

  1. Set up an Amazon Managed Grafana workspace.
  2. Sign in to your workspace.
  3. Choose Administration.
  4. Choose Add new data source.
  5. Choose CloudWatch.
  6. For Default Region, select your preferred AWS Region.
  7. For Namespaces of Custom Metrics, enter Glue.
  8. Choose Save & test.

Now the CloudWatch data source has been registered.

  1. Copy the data source ID from the URL https://g-XXXXXXXXXX.grafana-workspace.<region>.amazonaws.com/datasources/edit/<data-source-ID>/.

The next step is to prepare the JSON template file.

  1. Download the Grafana template.
  2. Replace <data-source-id> in the JSON file with your Grafana data source ID.

Lastly, configure the dashboard.

  1. On the Grafana console, choose Dashboards.
  2. Choose Import on the New menu.
  3. Upload your JSON file, and choose Import.

The Grafana dashboard visualizes AWS Glue observability metrics, as shown in the following screenshots.

The sample dashboard has the following charts:

  • [Reliability] Job Run Errors Breakdown
  • [Throughput] Bytes Read & Write
  • [Throughput] Records Read & Write
  • [Resource Utilization] Worker Utilization
  • [Job Performance] Skewness
  • [Resource Utilization] Disk Used (%)
  • [Resource Utilization] Disk Available (GB)
  • [Executor OOM] OOM Error Count
  • [Executor OOM] Heap Memory Used (%)
  • [Driver OOM] OOM Error Count
  • [Driver OOM] Heap Memory Used (%)

Analyze the causes of job failures

Let’s try analyzing the causes of job run failures of the job iot_data_processing.

First, look at the pie chart [Reliability] Job Run Errors Breakdown. This pie chart quickly identifies which errors are most common.

Then filter with the job name iot_data_processing to see the common errors for this job.

We can observe that the majority (75%) of failures were due to glue.error.DISK_NO_SPACE_ERROR.

Next, look at the line chart [Resource Utilization] Disk Used (%) to understand the driver’s used disk space during the job runs. For this job, the green line shows the driver’s disk usage, and the yellow line shows the average of the executors’ disk usage.

We can observe that there were three times when 100% of disk was used in executors.

Next, look at the line chart [Throughput] Records Read & Write to see whether the data volume was changed and whether it impacted disk usage.

The chart shows that around four billion records were read at the beginning of this range; however, around 63 billion records were read at the peak. This means that the incoming data volume has significantly increased, and caused local disk space shortage in the worker nodes. For such cases, you can increase the number of workers, enable auto scaling, or choose larger worker types.

After implementing those suggestions, we can see lower disk usage and a successful job run.

(Optional) Configure cross-account setup

We can optionally configure a cross-account setup. Cross-account metrics depend on CloudWatch cross-account observability. In this setup, we expect the following environment:

  • AWS accounts are not managed in AWS Organizations
  • You have two accounts: one account is used as the monitoring account where Grafana is located, another account is used as the source account where the AWS Glue-based data integration pipeline is located

To configure a cross-account setup for this environment, complete the following steps for each account.

Monitoring account

Complete the following steps to configure your monitoring account:

  1. Sign in to the AWS Management Console using the account you will use for monitoring.
  2. On the CloudWatch console, choose Settings in the navigation pane.
  3. Under Monitoring account configuration, choose Configure.
  4. For Select data, choose Metrics.
  5. For List source accounts, enter the AWS account ID of the source account that this monitoring account will view.
  6. For Define a label to identify your source account, choose Account name.
  7. Choose Configure.

Now the account is successfully configured as a monitoring account.

  1. Under Monitoring account configuration, choose Resources to link accounts.
  2. Choose Any account to get a URL for setting up individual accounts as source accounts.
  3. Choose Copy URL.

You will use the copied URL from the source account in the next steps.

Source account

Complete the following steps to configure your source account:

  1. Sign in to the console using your source account.
  2. Enter the URL that you copied from the monitoring account.

You can see the CloudWatch settings page, with some information filled in.

  1. For Select data, choose Metrics.
  2. Do not change the ARN in Enter monitoring account configuration ARN.
  3. The Define a label to identify your source account section is pre-filled with the label choice from the monitoring account. Optionally, choose Edit to change it.
  4. Choose Link.
  5. Enter Confirm in the box and choose Confirm.

Now your source account has been configured to link to the monitoring account. The metrics emitted in the source account will show on the Grafana dashboard in the monitoring account.

To learn more, see CloudWatch cross-account observability.

Considerations

The following are some considerations when using this solution:

  • Grafana integration is defined for real-time monitoring. If you have a basic understanding of your jobs, it will be straightforward for you to monitor performance, errors, and more on the Grafana dashboard.
  • Amazon Managed Grafana depends on AWS IAM Identify Center. This means you need to manage single sign-on (SSO) users separately, not just AWS Identity and Access Management (IAM) users and roles. It also requires another sign-in step from the AWS console. The Amazon Managed Grafana pricing model depends on an active user license per workspace. More users can cause more charges.
  • Graph lines are visualized per job. If you want to see the lines across all the jobs, you can choose ALL in the control.

Conclusion

AWS Glue job observability metrics offer a powerful new capability for monitoring data pipeline performance in real time. By streaming key metrics to CloudWatch and visualizing them in Grafana, you gain more fine-grained visibility that wasn’t possible before. This post showed how straightforward it is to enable observability metrics and integrate the data with Grafana using Amazon Managed Grafana. We explored the different metrics available and how to build customized Grafana dashboards to surface actionable insights.

Observability is now an essential part of robust data orchestration on AWS. With the ability to monitor data integration trends in real time, you can optimize costs, performance, and reliability.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Xiaoxi Liu is a Software Development Engineer on the AWS Glue team. Her passion is building scalable distributed systems for efficiently managing big data on the cloud, and her concentrations are distributed system, big data, and cloud computing.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed data infrastructure and processing systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18-year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

  1. A scheduled event rule on EventBridge triggers a Step Functions workflow.
  2. The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
  3. When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
  4. Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
  5. QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.

Prerequisites

For this walkthrough, you need the following:

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

  • An EventBridge rule that triggers the Step Functions state machine on a schedule
  • An AWS Glue database and a catalog table
  • An Athena workgroup
  • Three S3 buckets:
    • For AWS Clean Rooms to upload the results of query runs
    • For Athena to upload the results for the queries
    • For storing access logs of other buckets
  • A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
  • An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
  • AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

  1. Choose Launch Stack:

Launch Button

  1. Enter cleanrooms-query-automation-blog for Stack name.
  2. Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
  3. Choose Next.

  1. Choose Next again.
  2. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

  • AthenaWorkGroup – The Athena workgroup
  • EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
  • GlueDatabase – The AWS Glue database
  • GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
  • S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
  • StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

  1. On the Step Functions console, navigate to the state machine you created.
  2. On the state machine details page, locate the latest query run.

The details page lists the completed steps:

  • The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
  • The state machine waits for 30 seconds before checking the status of the query run.
  • After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
  • The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
  • The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
  • After a successful update of the AWS Glue table metadata, the state machine is complete.
  1. On the Athena console, switch the workgroup to CustomWorkgroup.
  2. Run the following query:
“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

  1. Sign in to the QuickSight console.
  2. Choose your user name, then choose Manage QuickSight.
  3. Choose Security and permissions.
  4. For QuickSight access to AWS services, choose Manage.
  5. For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
  6. Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Select Athena.
  4. Enter a name for your dataset.
  5. Choose Create data source.
  6. Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
  7. Select Directly query your data.
  8. Choose Visualize.

  1. On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

  1. Manually delete the S3 buckets and the data stored in the bucket.
  2. Delete the CloudFormation templates.
  3. Delete the QuickSight analysis.
  4. Delete the data source.

Conclusion

In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources


About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Identify Java nested dependencies with Amazon Inspector SBOM Generator

Post Syndicated from Chi Tran original https://aws.amazon.com/blogs/security/identify-java-nested-dependencies-with-amazon-inspector-sbom-generator/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector currently supports vulnerability reporting for Amazon Elastic Compute Cloud (Amazon EC2) instances, container images stored in Amazon Elastic Container Registry (Amazon ECR), and AWS Lambda.

Java archive files (JAR, WAR, and EAR) are widely used for packaging Java applications and libraries. These files can contain various dependencies that are required for the proper functioning of the application. In some cases, a JAR file might include other JAR files within its structure, leading to nested dependencies. To help maintain the security and stability of Java applications, you must identify and manage nested dependencies.

In this post, I will show you how to navigate the challenges of uncovering nested Java dependencies, guiding you through the process of analyzing JAR files and uncovering these dependencies. We will focus on the vulnerabilities that Amazon Inspector identifies using the Amazon Inspector SBOM Generator.

The challenge of uncovering nested Java dependencies

Nested dependencies in Java applications can be outdated or contain known vulnerabilities linked to Common Vulnerabilities and Exposures (CVEs). A crucial issue that customers face is the tendency to overlook nested dependencies during analysis and triage. This oversight can lead to the misclassification of vulnerabilities as false positives, posing a security risk.

This challenge arises from several factors:

  • Volume of vulnerabilities — When customers encounter a high volume of vulnerabilities, the sheer number can be overwhelming, making it challenging to dedicate sufficient time and resources to thoroughly analyze each one.
  • Lack of tools or insufficient tooling — There is often a gap in the available tools to effectively identify nested dependencies (for example, mvn dependency:tree, OWASP Dependency-Check). Without the right tools, customers can miss critical dependencies hidden deep within their applications.
  • Understanding the complexity — Understanding the intricate web of nested dependencies requires a specific skill set and knowledge base. Deficits in this area can hinder effective analysis and risk mitigation.

Overview of nested dependencies

Nested dependencies occur when a library or module that is required by your application relies on additional libraries or modules. This is a common scenario in modern software development because developers often use third-party libraries to build upon existing solutions and to benefit from the collective knowledge of the open-source community.

In the context of JAR files, nested dependencies can arise when a JAR file includes other JAR files as part of its structure. These nested files can have their own dependencies, which might further depend on other libraries, creating a chain of dependencies. Nested dependencies help to modularize code and promote code reuse, but they can introduce complexity and increase the potential for security vulnerabilities if they aren’t managed properly.

Why it’s important to know what dependencies are consumed in a JAR file

Consider the following examples, which depict a typical file structure of a Java application to illustrate how nested dependencies are organized:

Example 1 — Log4J dependency

MyWebApp/
|-- mywebapp-1.0-SNAPSHOT.jar
|   |-- spring-boot-3.0.2.jar
|   |   |-- spring-boot-autoconfigure-3.0.2.jar
|   |   |   |-- ...
|   |   |   |   |-- log4j-to-slf4j.jar

This structure includes the following files and dependencies:

  • mywebapp-1.0-SNAPSHOT.jar is the main application JAR file.
  • Within mywebapp-1.0-SNAPSHOT.jar, there’s spring-boot-3.0.2.jar, which is a dependency of the main application.
  • Nested within spring-boot-3.0.2.jar, there’s spring-boot-autoconfigure-3.0.2.jar, a transitive dependency.
  • Within spring-boot-autoconfigure-3.0.2.jar, there’s log4j-to-slf4j.jar, which is our nested Log4J dependency.

This structure illustrates how a Java application might include nested dependencies, with Log4J nested within other libraries. The actual nesting and dependencies will vary based on the specific libraries and versions that you use in your project.

Example 2 — Jackson dependency

MyFinanceApp/
|-- myfinanceapp-2.5.jar
|   |-- jackson-databind-2.9.10.jar
|   |   |-- jackson-core-2.9.10.jar
|   |   |   |-- ...
|   |   |-- jackson-annotations-2.9.10.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

  • myfinanceapp-2.5.jar is the primary application JAR file.
  • Within myfinanceapp-2.5.jar, there is jackson-databind-2.9.10.1.jar, which is a library that the main application relies on for JSON processing.
  • Nested within jackson-databind-2.9.10.1.jar, there are other Jackson components such as jackson-core-2.9.10.jar and jackson-annotations-2.9.10.jar. These are dependencies that jackson-databind itself requires to function.

This structure is an example for Java applications that use Jackson for JSON operations. Because Jackson libraries are frequently updated to address various issues, including performance optimizations and security fixes, developers need to be aware of these nested dependencies to keep their applications up-to-date and secure. If you have detailed knowledge of where these components are nested within your application, it will be easier to maintain and upgrade them.

Example 3 — Hibernate dependency

MyERPSystem/
|-- myerpsystem-3.1.jar
|   |-- hibernate-core-5.4.18.Final.jar
|   |   |-- hibernate-validator-6.1.5.Final.jar
|   |   |   |-- ...
|   |   |-- hibernate-entitymanager-5.4.18.Final.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

  • myerpsystem-3.1.jar as the primary JAR file of the application.
  • Within myerpsystem-3.1.jar, hibernate-core-5.4.18.Final.jar serves as a dependency for object-relational mapping (ORM) capabilities.
  • Nested dependencies such as hibernate-validator-6.1.5.Final.jar and hibernate-entitymanager-5.4.18.Final.jar are crucial for the validation and entity management functionalities that Hibernate provides.

In instances where MyERPSystem encounters operational issues due to a mismatch between the Hibernate versions and another library (that is, a newer version of Spring expecting a different version of Hibernate), developers can use the detailed insights that Amazon Inspector SBOM Generator provides. This tool helps quickly pinpoint the exact versions of Hibernate and its nested dependencies, facilitating a faster resolution to compatibility problems.

Here are some reasons why it’s important to understand the dependencies that are consumed within a JAR file:

  • Security — Nested dependencies can introduce vulnerabilities if they are outdated or have known security issues. A prime example is the Log4J vulnerability discovered in late 2021 (CVE-2021-44228). This vulnerability was critical because Log4J is a widely used logging framework, and threat actors could have exploited the flaw remotely, leading to serious consequences. What exacerbated the issue was the fact that Log4J often existed as a nested dependency in various Java applications (see Example 1), making it difficult for organizations to identify and patch each instance.
  • Compliance — Many organizations must adhere to strict policies about third-party libraries for licensing, regulatory, or security reasons. Not knowing the dependencies, especially nested ones such as in the Log4J case, can lead to non-compliance with these policies.
  • Maintainability — It’s crucial that you stay informed about the dependencies within your project for timely updates or replacements. Consider the Jackson library (Example 2), which is often updated to introduce new features or to patch security vulnerabilities. Managing these updates can be complex, especially when the library is a nested dependency.
  • Troubleshooting — Identifying dependencies plays a critical role in resolving operational issues swiftly. An example of this is addressing compatibility issues between Hibernate and other Java libraries or frameworks within your application due to version mismatches (Example 3). Such problems often manifest as unexpected exceptions or degraded performance, so you need to have a precise understanding of the libraries involved.

These examples underscore that you need to have deep visibility into JAR file contents to help protect against immediate threats and help ensure long-term application health and compliance.

Existing tooling limitations

When analyzing Java applications for nested dependencies, one of the main challenges is that existing tools can’t efficiently narrow down the exact location of these dependencies. This issue is particularly evident with tools such as mvn dependency:tree, OWASP Dependency-Check, and similar dependency analysis solutions.

Although tools are available to analyze Java applications for nested dependencies, they often fall short in several key areas. The following points highlight common limitations of these tools:

  • Inadequate depth in dependency trees — Although other tools provide a hierarchical view of project dependencies, they often fail to delve deep enough to reveal nested dependencies, particularly those that are embedded within other JAR files as nested dependencies. Nested dependencies are repackaged within a library and aren’t immediately visible in the standard dependency tree.
  • Lack of specific location details — These tools typically don’t offer the granularity needed to pinpoint the exact location of a nested dependency within a JAR file. For large and complex Java applications, it may be challenging to identify and address specific dependencies, especially when they are deeply embedded.
  • Complexity in large projects — In projects with a vast and intricate network of dependencies, these tools can struggle to provide clear and actionable insights. The output can be complicated and difficult to navigate, leaving customers without a clear path to identifying critical dependencies.

Address tooling limitations with Amazon Inspector SBOM Generator

The Amazon Inspector SBOM Generator (Sbomgen) introduces a significant advancement in the identification of nested dependencies in Java applications. Although the concept of monitoring dependencies is well-established in software development, AWS has tailored this tool to enhance visibility into the complexities of software compositions. By generating a software bill of materials (SBOM) for a container image, Sbomgen provides a detailed inventory of the software installed on a system, including hidden nested dependencies that traditional tools can overlook. This capability enriches the existing toolkit, offering a more granular and actionable understanding of the dependency structure of your applications.

Sbomgen works by scanning for files that contain information about installed packages. Upon finding such files, it extracts essential data such as package names, versions, and other metadata. Then it transforms this metadata into a CycloneDX SBOM, providing a structured and detailed view of the dependencies.

For information about how to install Sbomgen, see Installing Amazon Inspector SBOM Generator (Sbomgen)

A key feature of Sbomgen is its ability to provide explicit paths to each dependency.

For example, given a compiled jar application MyWebApp-0.0.1-SNAPSHOT.jar, users can run the following CLI command with Sbomgen:

./inspector-sbomgen localhost --path /path/to/MyWebApp-0.0.1-SNAPSHOT.jar --scanners java-jar

The output should look similar to the following:

{
  "bom-ref": "comp-11",
  "type": "library",
  "name": "org.apache.logging.log4j/log4j-to-slf4j",
  "version": "2.19.0",
  "hashes": [
    {
      "alg": "SHA-1",
      "content": "30f4812e43172ecca5041da2cb6b965cc4777c19"
    }
  ],
  "purl": "pkg:maven/org.apache.logging.log4j/[email protected]",
  "properties": [
...
    {
      "name": "amazon:inspector:sbom_generator:source_path",
      "value": "/tmp/MyWebApp-0.0.1-SNAPSHOT.jar/BOOT-INF/lib/spring-boot-3.0.2.jar/BOOT-INF/lib/spring-boot-autoconfigure-3.0.2.jar/BOOT-INF/lib/logback-classic-1.4.5.jar/BOOT-INF/lib/logback-core-1.4.5.jar/BOOT-INF/lib/log4j-to-slf4j-2.19.0.jar/META-INF/maven/org.apache.logging.log4j/log4j-to-slf4j/pom.properties"
    }
  ]
}

In this output, the amazon:inspector:sbom_collector:path property is particularly significant. It provides a clear and complete path to the location of the specific dependency (in this case, log4j-to-slf4j) within the application’s structure. This level of detail is crucial for several reasons:

  • Precise location identification — It helps you quickly and accurately identify the exact location of each dependency, which is especially useful for nested dependencies that are otherwise hard to locate.
  • Effective risk management — When you know the exact path of dependencies, you can more efficiently assess and address security risks associated with these dependencies.
  • Time and resource efficiency — It reduces the time and resources needed to manually trace and analyze dependencies, streamlining the vulnerability management process.
  • Enhanced visibility and transparency — It provides a clearer understanding of the application’s dependency structure, contributing to better overall management and maintenance.
  • Comprehensive package information — The detailed package information, including name, version, hashes, and package URL, of Sbomgen equips you with a thorough understanding of each dependency’s specifics, aiding in precise vulnerability tracking and software integrity verification.

Mitigate vulnerable dependencies

After you identify the nested dependencies in your Java JAR files, you should verify whether these dependencies are outdated or vulnerable. Amazon Inspector can help you achieve this by doing the following:

  • Comparing the discovered dependencies against a database of known vulnerabilities.
  • Providing a list of potentially vulnerable dependencies, along with detailed information about the associated CVEs.
  • Offering recommendations on how to mitigate the risks, such as updating the dependencies to a newer, more secure version.

By integrating Amazon Inspector into your software development lifecycle, you can continuously monitor your Java applications for vulnerable nested dependencies and take the necessary steps to help ensure that your application remains secure and compliant.

Conclusion

To help secure your Java applications, you must manage nested dependencies. Amazon Inspector provides an automated and efficient way to discover and mitigate potentially vulnerable dependencies in JAR files. By using the capabilities of Amazon Inspector, you can help improve the security posture of your Java applications and help ensure that they adhere to best practices.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Chi Tran

Chi Tran

Chi is a Security Researcher who helps ensure that AWS services, applications, and websites are designed and implemented to the highest security standards. He’s a SME for Amazon Inspector and enthusiastically assists customers with advanced issues and use cases. Chi is passionate about information security — API security, penetration testing (he’s OSCP, OSCE, OSWE, GPEN certified), application security, and cloud security.

How to enforce creation of roles in a specific path: Use IAM role naming in hierarchy models

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/how-to-enforce-creation-of-roles-in-a-specific-path-use-iam-role-naming-in-hierarchy-models/

An AWS Identity and Access Management (IAM) role is an IAM identity that you create in your AWS account that has specific permissions. An IAM role is similar to an IAM user because it’s an AWS identity with permission policies that determine what the identity can and cannot do on AWS. However, as outlined in security best practices in IAM, AWS recommends that you use IAM roles instead of IAM users. An IAM user is uniquely associated with one person, while a role is intended to be assumable by anyone who needs it. An IAM role doesn’t have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session that are only valid for certain period of time.

This blog post explores the effective implementation of security controls within IAM roles, placing a specific focus on the IAM role’s path feature. By organizing IAM roles hierarchically using paths, you can address key challenges and achieve practical solutions to enhance IAM role management.

Benefits of using IAM paths

A fundamental benefit of using paths is the establishment of a clear and organized organizational structure. By using paths, you can handle diverse use cases while creating a well-defined framework for organizing roles on AWS. This organizational clarity can help you navigate complex IAM setups and establish a cohesive structure that’s aligned with your organizational needs.

Furthermore, by enforcing a specific structure, you can gain precise control over the scope of permissions assigned to roles, helping to reduce the risk of accidental assignment of overly permissive policies. By assisting in preventing inadvertent policy misconfigurations and assisting in coordinating permissions with the planned organizational structure, this proactive solution improves security. This approach is highly effective when you consistently apply established naming conventions to paths, role names, and policies. Enforcing a uniform approach to role naming enhances the standardization and efficiency of IAM role management. This practice fosters smooth collaboration and reduces the risk of naming conflicts.

Path example

In IAM, a role path is a way to organize and group IAM roles within your AWS account. You specify the role path as part of the role’s Amazon Resource Name (ARN).

As an example, imagine that you have a group of IAM roles related to development teams, and you want to organize them under a path. You might structure it like this:

Role name: Dev App1 admin
Role path: /D1/app1/admin/
Full ARN: arn:aws:iam::123456789012:role/D1/app1/admin/DevApp1admin

Role name: Dev App2 admin
Role path: /D2/app2/admin/
Full ARN: arn:aws:iam::123456789012:role/D2/app2/admin/DevApp2admin

In this example, the IAM roles DevApp1admin and DevApp2admin are organized under two different development team paths: D1/app1/admin and D2/app2/admin, respectively. The role path provides a way to group roles logically, making it simpler to manage and understand their purpose within the context of your organization.

Solution overview

Figure 1: Sample architecture

Figure 1: Sample architecture

The sample architecture in Figure 1 shows how you can separate and categorize the enterprise roles and development team roles into a hierarchy model by using a path in an IAM role. Using this hierarchy model, you can enable several security controls at the level of the service control policy (SCP), IAM policy, permissions boundary, or the pipeline. I recommend that you avoid incorporating business unit names in paths because they could change over time.

Here is what the IAM role path looks like as an ARN:

arn:aws:iam::123456789012:role/EnT/iam/adm/IAMAdmin

In this example, in the resource name, /EnT/iam/adm/ is the role path, and IAMAdmin is the role name.

You can now use the role path as part of a policy, such as the following:

arn:aws:iam::123456789012:role/EnT/iam/adm/*

In this example, in the resource name, /EnT/iam/adm/ is the role path, and * indicates any IAM role inside this path.

Walkthrough of examples for preventative controls

Now let’s walk through some example use cases and SCPs for a preventative control that you can use based on the path of an IAM role.

PassRole preventative control example

The following SCP denies passing a role for enterprise roles, except for roles that are part of the IAM admin hierarchy within the overall enterprise hierarchy.

		{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DenyEnTPassRole",
			"Effect": "Deny",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::*:role/EnT/*",
			"Condition": {
				"ArnNotLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"
				}
			}
		}
	]
}

With just a couple of statements in the SCP, this preventative control helps provide protection to your high-privilege roles for enterprise roles, regardless of the role’s name or current status.

This example uses the following paths:

  • /EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, and networking teams)
  • /fed/ — federated roles, which have interactive access
  • /iam/ — roles that are allowed to perform IAM actions, such as CreateRole, AttachPolicy, or DeleteRole

IAM actions preventative control example

The following SCP restricts IAM actions, including CreateRole, DeleteRole, AttachRolePolicy, and DetachRolePolicy, on the enterprise path.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyIAMActionsonEnTRoles",
            "Effect": "Deny",
            "Action": [
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:DetachRolePolicy",
                "iam:AttachRolePolicy"
            ],
            "Resource": "arn:aws:iam::*:role/EnT/*",
            "Condition": {
                "ArnNotLike": {
                    "aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"
                }
            }
        }
    ]
}

This preventative control denies an IAM role that is outside of the enterprise hierarchy from performing the actions CreateRole, DeleteRole, DetachRolePolicy, and AttachRolePolicy in this hierarchy. Every IAM role will be denied those API actions except the one with the path as arn:aws:iam::*:role/EnT/fed/iam/*

The example uses the following paths:

  • /EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, or network automation teams)
  • /fed/ — federated roles, which have interactive access
  • /iam/ — roles that are allowed to perform IAM actions (in this case, CreateRole, DeteleRole, DetachRolePolicy, and AttachRolePolicy)

IAM policies preventative control example

The following SCP policy denies attaching certain high-privilege AWS managed policies such as AdministratorAccess outside of certain IAM admin roles. This is especially important in an environment where business units have self-service capabilities.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RolePolicyAttachment",
            "Effect": "Deny",
            "Action": "iam:AttachRolePolicy",
            "Resource": "arn:aws:iam::*:role/EnT/fed/iam/*",
            "Condition": {
                "ArnNotLike": {
                    "iam:PolicyARN": "arn:aws:iam::aws:policy/AdministratorAccess"
                }
            }
        }
    ]
}

AssumeRole preventative control example

The following SCP doesn’t allow non-production roles to assume a role in production accounts. Make sure to replace <Your production OU ID> and <your org ID> with your own information.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DenyAssumeRole",
			"Effect": "Deny",
			"Action": "sts:AssumeRole",
			"Resource": "*",
			"Condition": {
				"StringLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/np/*"
				},
				"ForAnyValue:StringLike": {
					"aws:ResourceOrgPaths": "<your org ID>/r-xxxx/<Your production OU ID>/*"
				}
			}
		}
	]
}

This example uses the /np/ path, which specifies non-production roles. The SCP denies non-production IAM roles from assuming a role in the production organizational unit (OU) (in our example, this is represented by <your org ID>/r-xxxx/<Your production OU ID>/*”). Depending on the structure of your organization, the ResourceOrgPaths will have one of the following formats:

  • “o-a1b2c3d4e5/*”
  • “o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/*”
  • “o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/ou-ab12-22222222/”

Walkthrough of examples for monitoring IAM roles (detective control)

Now let’s walk through two examples of detective controls.

AssumeRole in CloudTrail Lake

The following is an example of a detective control to monitor IAM roles in AWS CloudTrail Lake.

SELECT
    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
FROM
    <Event data store ID>
WHERE
    userIdentity.arn IS NOT NULL
    AND eventName = 'AssumeRole'
    AND userIdentity.arn LIKE '%/np/%'
    AND errorCode = 'AccessDenied'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out AssumeRole events for non-production roles in the organization for AccessDenied errors. The output is stored in an Amazon Simple Storage Service (Amazon S3) bucket from CloudTrail Lake, from which the csv file can be downloaded. The following shows some example output:

Username,eventTime,eventSource,eventName,sourceIPAddress,errorCode,errorMessage
arn:aws:sts::123456789012:assumed-role/np/test,2023-12-09 10:35:45.000,iam.amazonaws.com,AssumeRole,11.11.113.113,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/np/test is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::123456789012:role/hello because no identity-based policy allows the sts:AssumeRole action

You can modify the query to audit production roles as well.

CreateRole in CloudTrail Lake

Another example of a CloudTrail Lake query for a detective control is as follows:

SELECT
    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
FROM
    <Event data store ID>
WHERE
    userIdentity.arn IS NOT NULL
    AND eventName = 'CreateRole'
    AND userIdentity.arn LIKE '%/EnT/fed/iam/%'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out CreateRole events for roles in the /EnT/fed/iam/ hierarchy. The following are some example outputs:

Username,eventTime,eventSource,eventName,sourceIPAddress,errorCode,errorMessage

arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:31:11.000,iam.amazonaws.com,CreateRole,10.10.10.10,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:33:10.000,iam.amazonaws.com,CreateRole,10.10.10.10,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

Because these roles can create additional enterprise roles, you should audit roles created in this hierarchy.

Important considerations

When you implement specific paths for IAM roles, make sure to consider the following:

  • The path of an IAM role is part of the ARN. After you define the ARN, you can’t change it later. Therefore, just like the name of the role, consider what the path should be during the early discussions of design.
  • IAM roles can’t have the same name, even on different paths.
  • When you switch roles through the console, you need to include the path because it’s part of the role’s ARN.
  • The path of an IAM role can’t exceed 512 characters. For more information, see IAM and AWS STS quotas.
  • The role name can’t exceed 64 characters. If you intend to use a role with the Switch Role feature in the AWS Management Console, then the combined path and role name can’t exceed 64 characters.
  • When you create a role through the console, you can’t set an IAM role path. To set a path for the role, you need to use automation, such as AWS Command Line Interface (AWS CLI) commands or SDKs. For example, you might use an AWS CloudFormation template or a script that interacts with AWS APIs to create the role with the desired path.

Conclusion

By adopting the path strategy, you can structure IAM roles within a hierarchical model, facilitating the implementation of security controls on a scalable level. You can make these controls effective for IAM roles by applying them to a path rather than specific roles, which sets this approach apart.

This strategy can help you elevate your overall security posture within IAM, offering a forward-looking solution for enterprises. By establishing a scalable IAM hierarchy, you can help your organization navigate dynamic changes through a robust identity management structure. A well-crafted hierarchy reduces operational overhead by providing a versatile framework that makes it simpler to add or modify roles and policies. This scalability can help streamline the administration of IAM and help your organization manage access control in evolving environments.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Varun Sharma

Varun Sharma

Varun is an AWS Cloud Security Engineer who wears his security cape proudly. With a knack for unravelling the mysteries of Amazon Cognito and IAM, Varun is a go-to subject matter expert for these services. When he’s not busy securing the cloud, you’ll find him in the world of security penetration testing. And when the pixels are at rest, Varun switches gears to capture the beauty of nature through the lens of his camera.

Use multiple bookmark keys in AWS Glue JDBC jobs

Post Syndicated from Durga Prasad original https://aws.amazon.com/blogs/big-data/use-multiple-bookmark-keys-in-aws-glue-jdbc-jobs/

AWS Glue is a serverless data integrating service that you can use to catalog data and prepare for analytics. With AWS Glue, you can discover your data, develop scripts to transform sources into targets, and schedule and run extract, transform, and load (ETL) jobs in a serverless environment. AWS Glue jobs are responsible for running the data processing logic.

One important feature of AWS Glue jobs is the ability to use bookmark keys to process data incrementally. When an AWS Glue job is run, it reads data from a data source and processes it. One or more columns from the source table can be specified as bookmark keys. The column should have sequentially increasing or decreasing values without gaps. These values are used to mark the last processed record in a batch. The next run of the job resumes from that point. This allows you to process large amounts of data incrementally. Without job bookmark keys, AWS Glue jobs would have to reprocess all the data during every run. This can be time-consuming and costly. By using bookmark keys, AWS Glue jobs can resume processing from where they left off, saving time and reducing costs.

This post explains how to use multiple columns as job bookmark keys in an AWS Glue job with a JDBC connection to the source data store. It also demonstrates how to parameterize the bookmark key columns and table names in the AWS Glue job connection options.

This post is focused towards architects and data engineers who design and build ETL pipelines on AWS. You are expected to have a basic understanding of the AWS Management Console, AWS Glue, Amazon Relational Database Service (Amazon RDS), and Amazon CloudWatch logs.

Solution overview

To implement this solution, we complete the following steps:

  1. Create an Amazon RDS for PostgreSQL instance.
  2. Create two tables and insert sample data.
  3. Create and run an AWS Glue job to extract data from the RDS for PostgreSQL DB instance using multiple job bookmark keys.
  4. Create and run a parameterized AWS Glue job to extract data from different tables with separate bookmark keys

The following diagram illustrates the components of this solution.

Deploy the solution

For this solution, we provide an AWS CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:

  • An RDS for PostgreSQL instance
  • An Amazon Simple Storage Service (Amazon S3) bucket to store the data extracted from the RDS for PostgreSQL instance
  • An AWS Identity and Access Management (IAM) role for AWS Glue
  • Two AWS Glue jobs with job bookmarks enabled to incrementally extract data from the RDS for PostgreSQL instance

To deploy the solution, complete the following steps:

  1. Choose  to launch the CloudFormation stack:
  2. Enter a stack name.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.
  5. Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.
  6. When the stack is complete, copy the AWS Glue scripts to the S3 bucket job-bookmark-keys-demo-<accountid>.
  7. Open AWS CloudShell.
  8. Run the following commands and replace <accountid> with your AWS account ID:
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_1_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_1_job.py
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_2_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_2_job.py

Add sample data and run AWS Glue jobs

In this section, we connect to the RDS for PostgreSQL instance via AWS Lambda and create two tables. We also insert sample data into both the tables.

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the function LambdaRDSDDLExecute.
  3. Choose Test and choose Invoke for the Lambda function to insert the data.


The two tables product and address will be created with sample data, as shown in the following screenshot.

Run the multiple_job_bookmark_keys AWS Glue job

We run the multiple_job_bookmark_keys AWS Glue job twice to extract data from the product table of the RDS for PostgreSQL instance. In the first run, all the existing records will be extracted. Then we insert new records and run the job again. The job should extract only the newly inserted records in the second run.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job multiple_job_bookmark_keys.
  3. Choose Run to run the job and choose the Runs tab to monitor the job progress.
  4. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  5. Choose the log stream in the next window to see the output logs printed.

    The AWS Glue job extracted all records from the source table product. It keeps track of the last combination of values in the columns product_id and version.Next, we run another Lambda function to insert a new record. The product_id 45 already exists, but the inserted record will have a new version as 2, making the combination sequentially increasing.
  6. Run the LambdaRDSDDLExecute_incremental Lambda function to insert the new record in the product table.
  7. Run the AWS Glue job multiple_job_bookmark_keys again after you insert the record and wait for it to succeed.
  8. Choose the Output logs hyperlink under CloudWatch logs.
  9. Choose the log stream in the next window to see only the newly inserted record printed.

The job extracts only those records that have a combination greater than the previously extracted records.

Run the parameterised_job_bookmark_keys AWS Glue job

We now run the parameterized AWS Glue job that takes the table name and bookmark key column as parameters. We run this job to extract data from different tables maintaining separate bookmarks.

The first run will be for the address table with bookmarkkey as address_id. These are already populated with the job parameters.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job parameterised_job_bookmark_keys.
  3. Choose Run to run the job and choose the Runs tab to monitor the job progress.
  4. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  5. Choose the log stream in the next window to see all records from the address table printed.
  6. On the Actions menu, choose Run with parameters.
  7. Expand the Job parameters section.
  8. Change the job parameter values as follows:
    • Key --bookmarkkey with value product_id
    • Key --table_name with value product
    • The S3 bucket name is unchanged (job-bookmark-keys-demo-<accountnumber>)
  9. Choose Run job to run the job and choose the Runs tab to monitor the job progress.
  10. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  11. Choose the log stream to see all the records from the product table printed.

The job maintains separate bookmarks for each of the tables when extracting the data from the source data store. This is achieved by adding the table name to the job name and transformation contexts in the AWS Glue job script.

Clean up

To avoid incurring future charges, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket with job-bookmark-keys in its name.
  3. Choose Empty to delete all the files and folders in it.
  4. On the CloudFormation console, choose Stacks in the navigation pane.
  5. Select the stack you created to deploy the solution and choose Delete.

Conclusion

This post demonstrated passing more than one column of a table as jobBookmarkKeys in a JDBC connection to an AWS Glue job. It also explained how you can a parameterized AWS Glue job to extract data from multiple tables while keeping their respective bookmarks. As a next step, you can test the incremental data extract by changing data in the source tables.


About the Authors

Durga Prasad is a Sr Lead Consultant enabling customers build their Data Analytics solutions on AWS. He is a coffee lover and enjoys playing badminton.

Murali Reddy is a Lead Consultant at Amazon Web Services (AWS), helping customers build and implement data analytics solution. When he’s not working, Murali is an avid bike rider and loves exploring new places.

Combine transactional, streaming, and third-party data on Amazon Redshift for financial services

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/combine-transactional-streaming-and-third-party-data-on-amazon-redshift-for-financial-services/

Financial services customers are using data from different sources that originate at different frequencies, which includes real time, batch, and archived datasets. Additionally, they need streaming architectures to handle growing trade volumes, market volatility, and regulatory demands. The following are some of the key business use cases that highlight this need:

  • Trade reporting – Since the global financial crisis of 2007–2008, regulators have increased their demands and scrutiny on regulatory reporting. Regulators have placed an increased focus to both protect the consumer through transaction reporting (typically T+1, meaning 1 business day after the trade date) and increase transparency into markets via near-real-time trade reporting requirements.
  • Risk management – As capital markets become more complex and regulators launch new risk frameworks, such as Fundamental Review of the Trading Book (FRTB) and Basel III, financial institutions are looking to increase the frequency of calculations for overall market risk, liquidity risk, counter-party risk, and other risk measurements, and want to get as close to real-time calculations as possible.
  • Trade quality and optimization – In order to monitor and optimize trade quality, you need to continually evaluate market characteristics such as volume, direction, market depth, fill rate, and other benchmarks related to the completion of trades. Trade quality is not only related to broker performance, but is also a requirement from regulators, starting with MIFID II.

The challenge is to come up with a solution that can handle these disparate sources, varied frequencies, and low-latency consumption requirements. The solution should be scalable, cost-efficient, and straightforward to adopt and operate. Amazon Redshift features like streaming ingestion, Amazon Aurora zero-ETL integration, and data sharing with AWS Data Exchange enable near-real-time processing for trade reporting, risk management, and trade optimization.

In this post, we provide a solution architecture that describes how you can process data from three different types of sources—streaming, transactional, and third-party reference data—and aggregate them in Amazon Redshift for business intelligence (BI) reporting.

Solution overview

This solution architecture is created prioritizing a low-code/no-code approach with the following guiding principles:

  • Ease of use – It should be less complex to implement and operate with intuitive user interfaces
  • Scalable – You should be able to seamlessly increase and decrease capacity on demand
  • Native integration – Components should integrate without additional connectors or software
  • Cost-efficient – It should deliver balanced price/performance
  • Low maintenance – It should require less management and operational overhead

The following diagram illustrates the solution architecture and how these guiding principles were applied to the ingestion, aggregation, and reporting components.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution.

Launch Cloudformation Stack

This stack creates the following resources and necessary permissions to integrate the services:

Ingestion

To ingest data, you use Amazon Redshift Streaming Ingestion to load streaming data from the Kinesis data stream. For transactional data, you use the Redshift zero-ETL integration with Amazon Aurora MySQL. For third-party reference data, you take advantage of AWS Data Exchange data shares. These capabilities allow you to quickly build scalable data pipelines because you can increase the capacity of Kinesis Data Streams shards, compute for zero-ETL sources and targets, and Redshift compute for data shares when your data grows. Redshift streaming ingestion and zero-ETL integration are low-code/no-code solutions that you can build with simple SQLs without investing significant time and money into developing complex custom code.

For the data used to create this solution, we partnered with FactSet, a leading financial data, analytics, and open technology provider. FactSet has several datasets available in the AWS Data Exchange marketplace, which we used for reference data. We also used FactSet’s market data solutions for historical and streaming market quotes and trades.

Processing

Data is processed in Amazon Redshift adhering to an extract, load, and transform (ELT) methodology. With virtually unlimited scale and workload isolation, ELT is more suited for cloud data warehouse solutions.

You use Redshift streaming ingestion for real-time ingestion of streaming quotes (bid/ask) from the Kinesis data stream directly into a streaming materialized view and process the data in the next step using PartiQL for parsing the data stream inputs. Note that streaming materialized views differs from regular materialized views in terms of how auto refresh works and the data management SQL commands used. Refer to Streaming ingestion considerations for details.

You use the zero-ETL Aurora integration for ingesting transactional data (trades) from OLTP sources. Refer to Working with zero-ETL integrations for currently supported sources. You can combine data from all these sources using views, and use stored procedures to implement business transformation rules like calculating weighted averages across sectors and exchanges.

Historical trade and quote data volumes are huge and often not queried frequently. You can use Amazon Redshift Spectrum to access this data in place without loading it into Amazon Redshift. You create external tables pointing to data in Amazon Simple Storage Service (Amazon S3) and query similarly to how you query any other local table in Amazon Redshift. Multiple Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse. This feature simplifies accessing external data without writing complex ETL processes and enhances the ease of use of the overall solution.

Let’s review a few sample queries used for analyzing quotes and trades. We use the following tables in the sample queries:

  • dt_hist_quote – Historical quotes data containing bid price and volume, ask price and volume, and exchanges and sectors. You should use relevant datasets in your organization that contain these data attributes.
  • dt_hist_trades – Historical trades data containing traded price, volume, sector, and exchange details. You should use relevant datasets in your organization that contain these data attributes.
  • factset_sector_map – Mapping between sectors and exchanges. You can obtain this from the FactSet Fundamentals ADX dataset.

Sample query for analyzing historical quotes

You can use the following query to find weighted average spreads on quotes:

select
date_dt :: date,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
sector_name,
sum(spread * weight)/sum(weight) :: decimal (30,5) as weighted_average_spread
from
(
select date_dt,exchange_name,
factset_sector_desc sector_name,
((bid_price*bid_volume) + (ask_price*ask_volume))as weight,
((ask_price - bid_price)/ask_price) as spread
from
dt_hist_quotes a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
where ask_price <> 0 and bid_price <> 0
)
group by 1,2,3

Sample query for analyzing historical trades

You can use the following query to find $-volume on trades by detailed exchange, by sector, and by major exchange (NYSE and Nasdaq):

select
cast(date_dt as date) as date_dt,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
factset_sector_desc sector_name,
sum((price * volume):: decimal(30,4)) total_transaction_amt
from
dt_hist_trades a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
group by 1,2,3

Reporting

You can use Amazon QuickSight and Amazon Managed Grafana for BI and real-time reporting, respectively. These services natively integrate with Amazon Redshift without the need to use additional connectors or software in between.

You can run a direct query from QuickSight for BI reporting and dashboards. With QuickSight, you can also locally store data in the SPICE cache with auto refresh for low latency. Refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters for comprehensive details on how to integrate QuickSight with Amazon Redshift.

You can use Amazon Managed Grafana for near-real-time trade dashboards that are refreshed every few seconds. The real-time dashboards for monitoring the trade ingestion latencies are created using Grafana and the data is sourced from system views in Amazon Redshift. Refer to Using the Amazon Redshift data source to learn about how to configure Amazon Redshift as a data source for Grafana.

The users who interact with regulatory reporting systems include analysts, risk managers, operators, and other personas that support business and technology operations. Apart from generating regulatory reports, these teams require visibility into the health of the reporting systems.

Historical quotes analysis

In this section, we explore some examples of historical quotes analysis from the Amazon QuickSight dashboard.

Weighted average spread by sectors

The following chart shows the daily aggregation by sector of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. To calculate the average daily spread, each spread is weighted by the sum of the bid and the ask dollar volume. The query to generate this chart processes 103 billion of data points in total, joins each trade with the sector reference table, and runs in less than 10 seconds.

Weighted average spread by exchanges

The following chart shows the daily aggregation of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. The calculation methodology and query performance metrics are similar to those of the preceding chart.

Historical trades analysis

In this section, we explore some examples of historical trades analysis from the Amazon QuickSight dashboard.

Trade volumes by sector

The following chart shows the daily aggregation by sector of all the individual trades on NASDAQ and NYSE for 3 months. The query to generate this chart processes 3.6 billion of trades in total, joins each trade with the sector reference table, and runs in under 5 seconds.

Trade volumes for major exchanges

The following chart shows the daily aggregation by exchange group of all the individual trades for 3 months. The query to generate this chart has similar performance metrics as the preceding chart.

Real-time dashboards

Monitoring and observability is an important requirement for any critical business application such as trade reporting, risk management, and trade management systems. Apart from system-level metrics, it’s also important to monitor key performance indicators in real time so that operators can be alerted and respond as soon as possible to business-impacting events. For this demonstration, we have built dashboards in Grafana that monitor the delay of quote and trade data from the Kinesis data stream and Aurora, respectively.

The quote ingestion delay dashboard shows the amount of time it takes for each quote record to be ingested from the data stream and be available for querying in Amazon Redshift.

The trade ingestion delay dashboard shows the amount of time it takes for a transaction in Aurora to become available in Amazon Redshift for querying.

Clean up

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Conclusion

Increasing volumes of trading activity, more complex risk management, and enhanced regulatory requirements are leading capital markets firms to embrace real-time and near-real-time data processing, even in mid- and back-office platforms where end of day and overnight processing was the standard. In this post, we demonstrated how you can use Amazon Redshift capabilities for ease of use, low maintenance, and cost-efficiency. We also discussed cross-service integrations to ingest streaming market data, process updates from OLTP databases, and use third-party reference data without having to perform complex and expensive ETL or ELT processing before making the data available for analysis and reporting.

Please reach out to us if you need any guidance in implementing this solution. Refer to Real-time analytics with Amazon Redshift streaming ingestion, Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift, and Working with AWS Data Exchange data shares as a producer for more information.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 18 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Alket Memushaj works as a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy for capital markets, working with partners and customers to deploy applications across the trade lifecycle to the AWS Cloud, including market connectivity, trading systems, and pre- and post-trade analytics and research platforms.

Ruben Falk is a Capital Markets Specialist focused on AI and data & analytics. Ruben consults with capital markets participants on modern data architecture and systematic investment processes. He joined AWS from S&P Global Market Intelligence where he was Global Head of Investment Management Solutions.

Jeff Wilson is a World-wide Go-to-market Specialist with 15 years of experience working with analytic platforms. His current focus is sharing the benefits of using Amazon Redshift, Amazon’s native cloud data warehouse. Jeff is based in Florida and has been with AWS since 2019.

Export a Software Bill of Materials using Amazon Inspector

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/export-a-software-bill-of-materials-using-amazon-inspector/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector has expanded capability that allows customers to export a consolidated Software Bill of Materials (SBOM) for supported Amazon Inspector monitored resources, excluding Windows EC2 instances.

Customers have asked us to provide additional software application inventory collected from Amazon Inspector monitored resources. This makes it possible to precisely track the software supply chain and security threats that might be connected to the results of the current Amazon Inspector. Generating an SBOM gives you critical security information that offers you visibility into specifics about your software supply chain, including the packages you use the most frequently and the related vulnerabilities that might affect your whole company.

This blog post includes steps that you can follow to export a consolidated SBOM for the resources monitored by Amazon Inspector across your organization in industry standard formats, including CycloneDx and SPDX. It also shares insights and approaches for analyzing SBOM artifacts using Amazon Athena.

Overview

An SBOM is defined as a nested inventory with a list of ingredients that make up software components. Security teams can export a consolidated SBOM to Amazon Simple Storage Service (Amazon S3) for an entire organization from the resource coverage page in the AWS Management Console for Amazon Inspector.

Using CycloneDx and SPDX industry standard formats, you can use insights gained from an SBOM to make decisions such as which software packages need to be updated across your organization or deprecated, if there’s no other option. Individual application or security engineers can also export an SBOM for a single resource or group of resources by applying filters for a specific account, resource type, resource ID, tags, or a combination of these as a part of the SBOM export workflow in the console or application programming interfaces.

Exporting SBOMs

To export Amazon Inspector SBOM reports to an S3 bucket, you must create and configure a bucket in the AWS Region where the SBOM reports are to be exported. You must configure your bucket permissions to allow only Amazon Inspector to put new objects into the bucket. This prevents other AWS services and users from adding objects to the bucket.

Each SBOM report is stored in an S3 bucket and has the name Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json), depending on the export format that you specify. You can also use S3 event notifications to alert different operational teams that new SBOM reports have been exported.

Amazon Inspector requires that you use an AWS Key Management Service (AWS KMS) key to encrypt the SBOM report. The key must be a customer managed, symmetric KMS encryption key and must be in the same Region as the S3 bucket that you configured to store the SBOM report. The new KMS key for the SBOM report requires a key policy to be configured to grant permissions for Amazon Inspector to use the key. (Shown in Figure 1.)

Figure 1: Amazon Inspector SBOM export

Figure 1: Amazon Inspector SBOM export

Deploy prerequisites

The AWS CloudFormation template provided creates an S3 bucket with an associated bucket policy to enable Amazon Inspector to export SBOM report objects into the bucket. The template also creates a new KMS key to be used for SBOM report exports and grants the Amazon Inspector service permissions to use the key.

The export can be initiated from the AWS Inspector delegated administrator account or the AWS Inspector administrator account itself. This way, the S3 bucket contains reports for the AWS Inspector member accounts. To export the SBOM reports from Amazon Inspector deployed in the same Region, make sure the CloudFormation template is deployed within the AWS account and Region. If you enabled AWS Inspector in multiple accounts, the CloudFormation stack must be deployed in each Region where AWS Inspector is enabled.

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Launch Stack

  2. Review the stack name and the parameters (MyKMSKeyName and MyS3BucketName) for the template. Note that the S3 bucket name must be unique.
  3. Choose Next and confirm the stack options.
  4. Go to the next page and choose Submit. The deployment of the CloudFormation stack will take 1–2 minutes.

After the CloudFormation stack has deployed successfully, you can use the S3 bucket and KMS key created by the stack to export SBOM reports.

Export SBOM reports

After setup is complete, you can export SBOM reports to an S3 bucket.

To export SBOM reports from the console

  1. Navigate to the AWS Inspector console in the same Region where the S3 bucket and KMS key were created.
  2. Select Export SBOMs from the navigation pane.
  3. Add filters to create reports for specific subsets of resources. The SBOMs for all active, supported resources are exported if you don’t supply a filter.
  4. Select the export file type you want. Options are Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json).
  5. Enter the S3 bucket URI from the output section of the CloudFormation template and enter the KMS key that was created.
  6. Choose Export. It can take 3–5 minutes to complete depending on the number of artifacts to be exported.
Figure 2: SBOM export configuration

Figure 2: SBOM export configuration

When complete, all SBOM artifacts will be in the S3 bucket. This gives you the flexibility to download the SBOM artifacts from the S3 bucket, or you can use Amazon S3 Select to retrieve a subset of data from an object using standard SQL queries.

Figure 3: Amazon S3 Select

Figure 3: Amazon S3 Select

You can also run advanced queries using Amazon Athena or create dashboards using Amazon QuickSight to gain insights and map trends.

Querying and visualization

With Athena, you can run SQL queries on raw data that’s stored in S3 buckets. The Amazon Inspector reports are exported to an S3 bucket, and you can query the data and create tables by following the Adding an AWS Glue crawler tutorial.

To enable AWS Glue to crawl the S3 data, you must add the role as described in the AWS Glue crawler tutorial to the AWS KMS key permissions so that AWS Glue can decrypt the S3 data.

The following is an example policy JSON that you can update for your use case. Make sure to replace the AWS account ID <111122223333> and S3 bucket name <DOC-EXAMPLE-BUCKET-111122223333> with your own information.

{
    "Sid": "Allow the AWS Glue crawler usage of the KMS key",
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::<111122223333>:role/service-role/AWSGlueServiceRole-S3InspectorSBOMReports"
    },
    "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey*"
    ],
    "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET-111122223333>"
},

Note: The role created for AWS Glue also needs permission to read the S3 bucket where the reports are exported for creating the crawlers. The AWS Glue AWS Identity and Access Management (IAM) role allows the crawler to run and access your Amazon S3 data stores.

After an AWS Glue Data Catalog has been built, you can run the crawler on a scheduled basis to help ensure that it’s kept up to date with the latest Amazon Inspector SBOM manifests as they’re exported into the S3 bucket.

You can further navigate to the added table using the crawler and view the data in Athena. Using Athena, you can run queries against the Amazon Inspector reports to generate output data relevant to your environment. The schema for the generated SBOM report is different depending on the specific resources (Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon Elastic Container Registry (Amazon ECR)) in the reports. So, depending on the schema, you can create a SQL Athena query to fetch information from the reports.

The following is an Athena example query that identifies the top 10 vulnerabilities for resources in an SBOM report. You can use the common vulnerability and exposures (CVE) IDs from the report to list the individual components affected by the CVEs.

SELECT
   account,
   vuln.id as vuln_id, 
   count(*) as vuln_count
FROM
   <Insert_table_name>,
   UNNEST(Inset_table_name.vulnerabilities)as t(vuln)
GROUP BY
   account,
   vuln.id
ORDER BY
vuln_count DESC
LIMIT 10;

The following Athena example query can be used to identify the top 10 operating systems (OS) along with the resource types and their count.

SELECT
   resource,
   metadata.component.name as os_name,
   count(*) as os_count 
FROM
   <Insert_table_name>
WHERE
   resource = 'AWS_LAMBDA_FUNCTION'
GROUP BY
   resource,
   metadata.component.name 
ORDER BY
   os_count DESC 
LIMIT 10;

If you have a package that has a critical vulnerability and you need to know if the package is used as a primary package or adds a dependency, you can use the following Athena sample query to check for the package in your application. In this example, I’m searching for a Log4j package. The result returns account ID, resource type, package_name, and package_count.

SELECT
   account,
   resource,
   comp.name as package_name,
   count(*) as package_count 
FROM
   <Insert_Table _name>,
   UNNEST(<Insert_Table_name>.components) as t(comp) 
WHERE
   comp.name = 'Log4j' 
GROUP BY
   account,
   comp.name,
   resource 
ORDER BY
   package_count DESC 
LIMIT 10 ;

Note: The sample Athena queries must be customized depending on the schema of the SBOM export report.

To further extend this solution, you can use Amazon QuickSight to produce dashboards to visualize the data by connecting to the AWS Glue table.

Conclusion

The new SBOM generation capabilities in Amazon Inspector improve visibility into the software supply chain by providing a comprehensive list of software packages across multiple levels of dependencies. You can also use SBOMs to monitor the licensing information for each of the software packages and identify potential licensing violations in your organization, helping you avoid potential legal risks.

The most important benefit of SBOM export is to help you comply with industry regulations and standards. By providing an industry-standard format (SPDX and CycloneDX) and enabling easy integration with other tools, systems, or services (such as Nexus IQ and WhiteSource), you can streamline the incident response processes, improve the accuracy and speed of security assessments, and adhere to compliance with regulatory requirements.

In addition to these benefits, the SBOM export feature provides a comprehensive and accurate understanding of the OS packages and software libraries found in their resources, further enhancing your ability to adhere to industry regulations and standards.

 
If you have feedback about this post, submit comments in the Comments section below. If you have any question/query in regard to information shared in this post, start a new thread on the AWS IAM Identity Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Varun Sharma

Varun Sharma

Varun is an AWS Cloud Security Engineer who wears his security cape proudly. With a knack for unravelling the mysteries of Amazon Cognito and IAM, Varun is a go-to subject matter expert for these services. When he’s not busy securing the cloud, you’ll find him in the world of security penetration testing. And when the pixels are at rest, Varun switches gears to capture the beauty of nature through the lens of his camera.