Tag Archives: Technical How-to

How to send web push notifications using Amazon Pinpoint

Post Syndicated from arrohan original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-send-web-push-notifications-using-amazon-pinpoint/

How to send push notifications on any website using AWS messaging tools

Web Push Notifications (also known as browser push notifications) are messages from a website you receive in your browser. These messages are intended to be rich, contextual, timely, personalized and best used to engage, re-engage, and retain website visitors. For instance, as a website owner you could use web push notifications to notify users about sales, important updates or new content on your website.

How are web push notifications different from native app push notifications?

Push notifications are short messages that are displayed directly on the user’s screen sent via mobile applications, providing timely information and messages like order status, promotions, or relevant news in the application.

Web push notifications are simply push notifications sent via web browsers (the browser application on the device), and they work across platforms – Desktop, mobile and tablet.  They are a newer channel than push notifications, and have now become a part of the modern marketing strategy alongside native app push notifications, emails and SMS.

In the case of mobile apps, the user must install the application to receive push notifications. In the case of web push, there is no need to download any software—it just takes one click on your website.

Why are Web Push Notifications useful?

Let’s consider a real-world example. Suppose you are an e-commerce website where customers can purchase products. Once, the purchase has been made, customers would be interested in getting real time updates of where the package is in transit, when is it likely to be delivered, a confirmation that the shipment has been delivered and so on. Web push notifications can be an excellent way of providing such updates. Accessing email on mobile devices is often unwieldy, SMS messages cannot support images and are constrained in length (also they typically they cost more money to send!). Push notifications are perfect for such a use case. Till now, the major constraint was that it would require users to install your app on their device. Web push notifications gives website owners and customers the power of push notifications without any need for driving app installs.

Marketers in a variety of sectors like travel, publishing, restaurant & delivery, finance and insurance can use push notifications to improve their down-to-funnel conversions.  From new content alerts to limited-time promotions to upcoming events, push messages are short, crisp and drive engagement, conversion, and retention. A short search on the AWS blogs website gives us a number of examples of businesses who have created value for their customers with the help of push notifications. Some of the key advantages of web push notifications are:

  • Easy opt in model: Unlike other marketing channels like email or SMS, web push notifications offer users a seamless opt-in experience ― Users simply select `Allow’ on a browser permission prompt. Users do not have to worry about sharing their personal data, like their name, email, or phone number nor do they have to go to the play store/app store and install an app on their device.
  • Increased Engagement:  Push notifications appear on a user’s desktop or mobile screen and are quick to grab attention. Since push messages are real time and have high visibility – they typically enjoy higher “Click Through Rate (CTR)” as compared to other channels like SMS or email.
  • Reach users even when they are not on your website: Web Push Notifications from your website are delivered and shown to the customer even if the user is visiting some other site or on some other app. In this respect (and most others), it is quite similar to app push notifications. Even if subscribers were offline when you sent your push campaign, they will get the push message delivered to them the next time they come online.
  • No need for users to install native apps: One of the most compelling reason for installing mobile apps, is because users could stay updated with the latest and the greatest – thanks to app push notifications. The additional cost of going to the play store/app store and installing the app is something which would often discourage users. This is especially true for countries and regions where users are still on lower end phones with limited storage space. Users would often have to uninstall apps (which might include yours too) that they do not use frequently in order to make space for other stuff.
  • Makes websites richer and more memorable: If you ask a room of developers what mobile device features are missing from the web, push notifications are always high on the list. This is no longer the case since browsers are increasingly adding support for web push notifications and this has offered website owners a powerful cross platform (Desktop & Mobile, Android & iOS) alternative as against developing and maintaining different native apps for different platforms. Web push notifications even appear quite similar to native mobile push on most smartphones.
  • Lower Cost: Unlike channels like SMS, sending web push notifications is absolutely free as browsers themselves offer support for it by adhering to the web push protocol. The only costs incurred will be that of sending push notifications as per the Pinpoint pricing policy.
  • Popular browsers support web push: Google Chrome, Firefox, Opera, Edge support web push on both Mobile and Desktop. What’s more the support for web push is continuously getting better. Refer to this link for the latest support status matrix across browsers and form factors.

What is Amazon Pinpoint?

Amazon Pinpoint is an AWS service that provides scalable, targeted multichannel communications. Amazon Pinpoint enables companies to send messages to customers through SMS, push notifications, in-app notifications, email, and voice channels. To learn more about Amazon Pinpoint, visit the website and documentation.

Web Push support on Firebase Cloud Messaging (FCM):

Firebase uses cloud services for its notification services on Android, iOS & Web. Firebase Cloud Messaging or FCM run on basic principles of tokens, which is uniquely generated for each device & later used for sending messages to respective devices. There are two key advantages of using FCM for sending web push notifications:

  • Abstracts away the complexity of onboarding to the web push protocol for push messages: Sending web push notifications directly without any third party in between requires your website to add support for the web push protocol. Adherence to the web push protocol requires website owners to perform some steps specific to wpn like adding VAPID headers and payload encryption of push messages. This would be additional work for website owners, especially for those businesses which are already onboarded to FCM for sending native app push notifications. FCM server side apis for sending web push notifications work pretty much the same way as they work for native apps. They abstract away the additional complexity of sending web push messages.
  • Send push notifications from Amazon Pinpoint via FCM: Amazon Pinpoint already supports integration with FCM, refer to documentation. Similar to how we add a FCM project in Pinpoint to send push messages to native android apps, in this blog post we will see how a similar integration can be leveraged to send web push notifications.

Advantages of sending Web Push Notifications with Amazon Pinpoint:

Now at this point, you might be thinking, Web push notifications can go a long way towards delighting customers and FCM already abstracts the complexities of sending web push. So why do I need Amazon Pinpoint?

Well, integration with Amazon Pinpoint offers a number of advantages. Here are a few:

  • Map FCM tokens to actual users and web app ‘installs’: FCM would give you tokens for each user on your website who subscribes for web push. Roughly speaking, an FCM token for each web app install with permissions to send push messages. To be able to send messages to these users we would need to store the FCM tokens for each user/web app install/browser instance. Amazon Pinpoint treats each browser instance as an endpoint and enables you to save the push tokens in the same way in which we would store native push tokens/mobile numbers/email addresses, i.e., as a primary identifier for that endpoint. This enables us to send messages to Pinpoint endpoints without caring about the underlying complexity of storing and managing push tokens.
  • Intelligently send web push, map user attributes to push tokens: Along with the push token, each pinpoint endpoint can also store other attributes like device characteristics, user Id and user attributes. This helps us to create dynamic and complex segments which can be used to send targeted web push notifications.
  • It is essentially the same as sending android native push: Create an FCM project, create FCM tokens, Create Pinpoint endpoints with the tokens, send push campaigns to those endpoints. Swap out native android code with service workers, JavaScript on the client and you get web push. It really is that simple.
  • Web push, native push, SMS or emails. One stop shop for reaching out to users on all channels: Pinpoint becomes your single backend for reaching out to users across multiple channels. For app users, send them app push, for users who prefer the web, you have web push.
  • Leverage Pinpoint features like Campaign Management, Events, Analytics and Segments: Read up about Amazon Pinpoint. It has a lot of great features which can help you better engage your users.

In this blog we will see how to send web push notifications using Amazon Pinpoint on a website built using AWS Amplify.

 Overview of solution

Enable web push by using FCM as an intermediary service and Pinpoint as an app server (map FCM tokens to actual users) and a push campaign management tool. Integrate web push protocol, FCM and Amazon Pinpoint.

Overview of how to setup Web Push - registering the customer

Overview of how to setup Web Push - sending push notifications

Walkthrough

In this blog post, we will create a simple demo website using Amplify which can be used to create web push subscriptions and also receive web push messages. We will integrate this website with FCM js sdk and Amazon Pinpoint to store the FCM push tokens on Pinpoint. Later we will see how to send web push notifications using Amazon Pinpoint with FCM acting as an intermediary.

The above can be broken down into the below simple and independent steps:

  • Create a project on FCM.
  • Generate web push notifications server keys on FCM.
  • Create a simple web app (website) using Amplify
  • Create an Amazon Pinpoint project. This is a one-line command which will be done as part of Amplify web app setup.
  • Make your amplify website web push capable. In this step we will also integrate with the FCM sdk for web push.
  • Configure the Pinpoint project and integrate it with FCM. It just involves adding the FCM server key to Pinpoint.
  • Go to the Amazon Pinpoint console and send a test web push message from your website. And we are done!

You can see checkout my demo website here.

The source code for this demo website (and the blog) is available here.

Prerequisites – Essentials

For this walkthrough, you should have the following prerequisites:

Prerequisites – Recommended

In addition to the necessary prerequisites mentioned above, I would highly recommend readers to go through the below in order to derive maximum value from this blog post.

  • Web Push fundamentals: Some basic reading up on web push notifications and going through a couple of relevant code samples. It is not compulsory to implement and understand everything, but it would be beneficial to have an elementary understanding of service workers, permissions, push subscriptions and notifications apis.
    • Introduction: Some of the sections are a bit detailed and complex, you need not go through all the sections completely at once. However, at least go through the overview and the how push works sections carefully.
    • Simple Code demo with explanations to help you get started.
  • FCM client-side code : You need not go through the send Message sections since we will not directly use FCM apis or the console. Instead, we will use the Pinpoint console to manage our push campaigns.
  • Building web apps with amplify: By the end of the tutorial, you should get clarity on how to build and host web apps using amplify. It will also help you become familiar with the amplify cli tool.
  • Read up on Amazon Pinpoint.

Setting up the demo web app

Let’s deploy the demo web app using AWS Amplify to see how all the parts come together.

Clone the code for the sample web app

git clone ssh://git.amazon.com/pkg/ArrohanWebPushPoc (branch: PinpointBlog) <github_link>

Create an FCM account and a project on the FCM developer console, on the FCM project add web push as a channel

It is possible that Firebase may change the UI of the console in the future so the given screenshots may not be exactly reflective of the UI, but the broad steps would remain the same.

  • Under “Engage”  click on ‘Cloud Messaging Tab’.  The page url should typically be of the form: https://console.firebase.google.com/u/0/project/<name_of_your_project>/notification.

Setting up web push - Getting the firebase push config

  • Under the option “Add an app to get started”, Click on the “web/javascript” (the one with the </> symbol) app.
  • Once you have created the project,  go to project settings. Click on General Tab. Replace the values in firebaseConfig main.js with the actual values for your project.

Setting up web push - Copying the firebase push config

Setting up web push - Code pointer for the firebase push config

Generate a public-private key pair for the FCM cloud messaging project

  • Under project settings, switch to the cloud messaging tab. Click generate key pair under web push certificates to generate a public-private key pair.

Setting up web push - Generate a public-private vapid key pair

  • Replace the <YOUR PUBLIC KEY> in the file main.js in the source code with the vapid public key you generated in the previous step.

Setting up web push - Copying the FCM server key

  • Note the Server key, you will need it during pinpoint project setup.

Setup an Amplify web app and integrate with pinpoint

  • Clone the code in the given repo, replace your FCM config and keys. Run npm install.
    • In case you face build errors due to package versions getting outdated (firebase, especially gets updated often, sometimes with breaking changes), please update the dependencies to the latest version. This post offers an easy way to identify outdated dependencies and update them.
  • Setup an Amplify web app. Note, for the purpose of this demo, you just need to setup a simple static website. Simply run amplify init. Enter the required details, the default config should work fine.

Setting up the example app for web push using Amazon Pinpoint

  • Create a pinpoint project and integrate with our web app through amplify cli:
    • Create a pinpoint project: Run amplify add analytics. Choose Amazon Pinpoint as the analytics provider and accept all defaults.
    • Please note when you add “analytics” to the project you will get a prompt which says something like – “Apps need authorization to send analytics events. Do you want to allow guests and unauthenticated users to send analytics events? (We recommend you allow this when getting started)” – Please accept and answer Yes when you get this.
    • Push to AWS: Run amplify push. A configuration file (aws-exports.js) will be added to the source directory. Notice, we are calling this file from our main.js file.

Setting up the example app for web push using Amazon Pinpoint

Pinpoint Project setup

  • Get the Server key of the FCM project you created earlier.

Setting up web push - Add FCM key to Pinpoint console

Run the web app and subscribe for push notifications

  • Run npm start, our web app will be running on http://localhost:8080/index.html
  • Click on enable push messages and click allow/accept on the browser permission prompt which follows. Once it is enabled, you will see a FCM token on the page, copy the token.

How the example app looks

Send a web push notification from the Amazon Pinpoint console

  • Open the pinpoint project you created in the previous step on the Pinpoint console. Click on your project and then go to test messaging. The process is exactly the same as the one for native apps described here. Under “destination type” select “Device Tokens” and paste the FCM token you copied in the previous step.

Sending a web push from the Pinpoint console

  • Fill in title, body and optionally URL (“Go to a URL” under “Actions”). Click on Send Message, you should get a push message on your browser.

How an example web push notification looks like on Desktop

Next steps

  • Host your app. Simply run amplify add hosting followed by amplify publish. Remember that for web push (service workers) to work, your site should be https.

Deploying the example app for web push using AWS Amplify

  • Create segments, campaigns and journeys on pinpoint and try sending web push messages through them.

Code Walkthrough

  • Gitfarmlink: https://code.amazon.com/packages/ArrohanWebPushPoc/trees/heads/PinpointBlog
  • File wise description:
    • package.json: Simple npm config file. It includes the list of dependencies and their versions used by our web app. For our use case, all we need is webpack and AWS Amplify.
    • package-lock.json: Auto generated config file generated by npm after resolving modules and package.json.
    • aws-exports.js: Auto generated configuration file created by Amplify cli. This file contains the configuration and endpoint metadata used to link your front end to your backend services. It will be structured similar to the sample config file.
    • webpack.config.js: Simple webpack configuration file
    • src: The folder which contains the source code for our web app. It contains:
      • service-worker.js: The service worker that we register for our website which is used to display push notifications. In the service worker we parse the notification payload sent through pinpoint and call the notification apis to display push notification with the appropriate fields.
      • index.html: The website html.
      • main.js: The heart of the web app. It does permission handling, push subscription management and communicates with FCM and Amazon Pinpoint.
      • images/icon-192×192.png: static icon that we display on our push messages. This would essentially be your website logo.

Conclusion

This small demo shows how we can send web push notifications using Amazon Pinpoint. As next steps to come up with an actual production ready solution, you can look into the following:

  • Develop deeper understanding and expertise on web push
  • Richer and smarter push notification: Add big images, action buttons, replace notifications using tags (for example, sports score updates) and explore other features in the show notifications api.
  • Smart push notifications: add custom business logic in the payload. Hint: use the “body” (“pinpoint.notification.body“) field on the pinpoint console to send a custom json string.
  • Driving more subscriptions: Leverage Amplify Analytics to track how users interact with the push subscribe UI. Think of where and how you might ask users to subscribe to drive maximum engagement.
  • Easy unsubscribe: Allow users an easy option to disable push notifications without having to block you from browser settings. Also, make sure that you are disabling that endpoint on pinpoint. Hint: use the updateEndpoint api and pass optOut from ‘ALL’ as the argument.
  • Targeted and personalised push notifications: Leverage Pinpoint segments to send users push notifications according to their interests and requirements. Hint: add user data to endpoints and use it to filter and create targeted segments.
  • Campaign management: Leverage pinpoint features like segments, analytics, campaigns, journeys and more!

Project Cleanup

In this section we will quickly go over the steps to delete the resources we created for this demo to make sure that we do not incur any charges.

  • Cleaning up all AWS resources including the Amazon Pinpoint project, S3 buckets for hosting (and any other resources you may have added): Simply run amplify delete from the project directory on your command line.
  • Cleaning up the FCM Project: Refer to the FCM support page for the steps to delete a project – https://support.google.com/firebase/answer/9137886 .
    • Open the project settings page: The URL will be of the form https://console.firebase.google.com/u/0/project/<your_project_identifier>/settings/general
    • Click on the delete project button at the bottom of the page.

Access Amazon Athena in your applications using the WebSocket API

Post Syndicated from Abhi Sodhani original https://aws.amazon.com/blogs/big-data/access-amazon-athena-in-your-applications-using-the-websocket-api/

Modern applications are built with modular independent components or microservices that rely on an API framework to communicate with services. Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products. To accomplish this on AWS, organizations use Amazon Simple Storage Service (Amazon S3) to provide cheap and reliable object storage to house their datasets. To enable interactive querying and analyzing their data in place using familiar SQL syntax, many teams are turning to Amazon Athena. Athena is an interactive query service that is used by modern applications to query large volumes of data on an S3 data lake using standard SQL.

When working with SQL databases, application developers and business analysts are most familiar with simple permissions management and synchronous query-response protocols—if a user has permissions to submit a query, they do so and receive the results from the server when the query is complete. Directly accessing Athena APIs, for example when integrating with a custom web application, requires an AWS Identity and Access Management (IAM) role for the applications, and requires you to build a custom process to poll for query completion asynchronously. The IAM role needs access to run Athena API calls, as well as S3 permissions to retrieve the Athena output stored on Amazon S3. Polling for Athena query completion when performed at several intervals could result in increased latency from the client perspective.

In this post, we present a solution that can integrate with your front-end application to query data from Amazon S3 using an Athena synchronous API invocation. With this solution, you can add a layer of abstraction to your application on direct Athena API calls and promote the access using the WebSocket API developed with Amazon API Gateway. The query results are returned back to the application as Amazon S3 presigned URLs.

Overview of solution

For illustration purposes, this post builds a COVID-19 data lake with a WebSocket API to handle Athena queries. Your application can invoke the WebSocket API to pull the data from Amazon S3 using an Athena SQL query, and the WebSocket API returns the JSON response with the presigned Amazon S3 URL. The application needs to parse the JSON message to read the presigned URL, download the data to local, and report the data back to the front end.

We use AWS Step Functions to poll the Athena query run. When the query is complete, Step Functions invokes an AWS Lambda function to generate the presigned URL and send the request back to the application.

The application doesn’t require direct access to Athena, just access to invoke the API. When using this solution, you should secure the API following AWS guidelines. For more information, refer to Controlling and managing access to a WebSocket API in API Gateway.

The following diagram summarizes the architecture, key components, and interactions in the solution.

Architecture diagram for the Athena WebSocket API. The user connects to the API through API Gateway. API Gateway uses Lambda and DynamoDB to store session data. SQL queries are routed to Amazon Athena and a Step Function polls for query status and returns the results back to the user.

The application is composed of the WebSocket API in API Gateway, which handles the connectivity between the client and Athena. A client application using the framework can submit the Athena SQL query and get back the presigned URL containing the query results data. The workflow includes the following steps:

  1. The application invokes the WebSocket API connection.
  2. A Lambda function is invoked to initiate the connection. The connection ID is stored in an Amazon DynamoDB
  3. When the client application is connected, it can invoke the runquery action, which invokes the RunQuery Lambda function.
  4. The function first runs the Athena query.
  5. When the query is started, the function checks the status and uses Step Functions to track the query progress.
  6. Step Functions invokes the third Lambda function to read the processed Athena results and get the presigned S3 URL. Failed messages are routed to an Amazon Simple Notification Service (Amazon SNS) topic, which you can subscribe to.
  7. The presigned URL is returned to the client application.
  8. The connection is closed using the OnDisconnect function.

The RunQuery Lambda function runs the Athena query using the start_query_execution request:

def run_query(client, query):
    """This function executes and sends the query request to Athena."""
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': params['Database']
        },
        ResultConfiguration={
            'OutputLocation': f's3://{params["BucketName"]}/{params["OutputDir"]}/'
        },
        WorkGroup=params["WorkGroup"]
    )
    return response

The Amazon S3 presigned URL is generated by invoking the generate_presigned_url request with the bucket and key information that hosts the Athena results. The code hard codes the presigner expiration to 120 seconds, which is configurable in the function input parameter PreSignerExpireSeconds. See the following code:

def signed_get_url(event):
    s3 = boto3.client('s3', region_name=params['Region'], config=Config(signature_version='s3v4'))
    # User provided body with object info
    bodyData = json.loads(event['body'])
    try:
        url = s3.generate_presigned_url(
            ClientMethod='get_object',
            Params={
                'Bucket': params['BucketName'],
                'Key': bodyData["ObjectName"]
            },
            ExpiresIn=int(params['PreSignerExpireSeconds'])
        )
        body = {'PreSignedUrl': url, 'ExpiresIn': params['PreSignerExpireSeconds']}
        response = {
            'statusCode': 200,
			'body': json.dumps(body),
            'headers': cors.global_returns["Allow Origin Header"]
        }
        logger.info(f"[MESSAGE] Response for PreSignedURL: {response}")
    except Exception as e:
        logger.exception(f"[MESSAGE] Unable to generate URL: {str(e)}")
        response = {
            'statusCode': 502,
            'body': 'Unable to generate PreSignedUrl',
            'headers': cors.global_returns["Allow Origin Header"]
        }
    return response

Prerequisites

This post assumes you have the following:

  • Access to an AWS account
  • Permissions to create an AWS CloudFormation stack
  • Permissions to create the following resources:
    • AWS Glue catalog databases and tables
    • API Gateway
    • Lambda function
    • IAM roles
    • Step Functions state machine
    • SNS topic
    • DynamoDB table

Enable the WebSocket API

To enable the WebSocket API of API Gateway, complete the following steps:

  1. Configure the Athena dataset.

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create a CloudFormation stack using the following template. If you’re signed in to your AWS account, the following page fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Getting started with AWS CloudFormation.

You can also use an existing Athena database to query, in which case you need to update the stack parameters.

  1. Sign in to the Athena console.

If this is the first time you’re using Athena, you must specify a query result location on Amazon S3. For more information about querying and accessing the data from Athena, see A public data lake for analysis of COVID-19 data.

  1. Configure the WebSocket framework using the following page, which deploys the API infrastructure using AWS Serverless Application Model (AWS SAM).
  2. Update the parameters pBucketName with the S3 bucket (in the us-east-2 region) that stores the Athena results and also update the database if you want to query an existing database.
  3. Select the check box to acknowledge creation of IAM roles and choose Deploy.

At a high level, these are the primary resources deployed by the application template:

  • An API Gateway with routes to the connect, disconnect, and query Lambda functions. Note that the API Gateway deployed with this sample doesn’t implement authentication and authorization. We recommend that you implement authentication and authorization before deploying into a production environment. Refer to Controlling and managing access to a WebSocket API in API Gateway to understand how to implement these security controls.
  • A DynamoDB table for tracking client connections.
  • Lambda functions to manage connection states using DynamoDB.
  • A Lambda function to run the query and start the step function. The function includes an associated IAM role and policies with permissions to Step Functions, the AWS Glue Data Catalog, Athena, AWS Key Management Service (AWS KMS), and Amazon S3. Note that the Lambda execution role gives read access to the Data Catalog and S3 bucket that you specify in the deployment parameters. We recommend that you don’t include a catalog that contains sensitive data without first understanding the impacts and implementing additional security controls.
  • A Lambda function with associated permissions to poll for the query results and return the presigned URL to the client.
  • A Step Functions state machine with associated permissions to run the polling Lambda function and send API notifications using Amazon SNS.

Test the setup

To test the WebSocket API, you can use wscat, an open-source command line tool.

  1. Install NPM.
  2. Install wscat:
$ npm install -g wscat
  1. On the console, connect to your published API endpoint by running the following command. The full URI to use can be found on the AWS CloudFormation console by finding the WebSocketURI output in the serverlessrepo-aws-app-athena-websocket-integration stack that was deployed by the AWS SAM application you deployed previously.
$ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/{STAGE}
  1. To test the runquery function, send a JSON message like the following example. This triggers the state machine to run your SQL query using Athena and, using Lambda, return an S3 presigned URL to your client, which you can access to download the query results. Note that the API accepts any valid Athena query. Additional query validation could be added to the internal Lambda function if desired.
$ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/{STAGE}
Connected (press CTRL+C to quit)
> {"action":"runquery", "data":"SELECT * FROM \"covid-19\".country_codes limit 5"}
< {"pre-signed-url": "https://xxx-s3.amazonaws.com/athena_api_access_results/xxxxx.csv?"}
  1. Copy the value for pre-signed-url and enter it into your browser window to access the results.

The presigned URL provides you temporary credentials to download the query results. For more information, refer to Using presigned URLs. This process can be integrated into a front-end web application to automatically download and display the results of the query.

Clean up

To avoid incurring ongoing charges, delete the resources you provisioned by deleting the CloudFormation stacks CovidLakeStacks and serverlessrepo-AthenaWebSocketIntegration via the AWS CloudFormation console. For detailed instructions, refer to the cleanup sections in the starter kit README files in the GitHub repo.

Conclusion

In this post, we showed how to integrate your application with Athena using the WebSocket API. We have included a GitHub repo for you to understand the code and modify it per your application requirements, to get the full benefits of the solution. We encourage you to further explore the features of the API Gateway WebSocket API to add in security using authorizers, view live invocations using dashboards, and expand the framework for more routes on action request.

Let’s stay in touch via the GitHub repo.


About the Authors

Abhi Sodhani is a Sr. AI/ML Solutions Architect at AWS. He helps customers with a wide range of solutions, including machine leaning, artificial intelligence, data lakes, data warehousing, and data visualization. Outside of work, he is passionate about books, yoga, and travel.

Robin Zimmerman's HeadshotRobin Zimmerman is a Data and ML Engineer with AWS Professional Services. He works with AWS enterprise customers to develop systems to extract value from large volumes of data using AWS data, analytics, and machine learning services. When he’s not working, you’ll probably find him in the mountains—rock climbing, skiing, mountain biking, or out on whatever other adventure he can dream up.

Use Apache Iceberg in a data lake to support incremental data processing

Post Syndicated from Flora Wu original https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-a-data-lake-to-support-incremental-data-processing/

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. It adds tables to compute engines including Spark, Trino, PrestoDB, Flink, and Hive using a high-performance table format that works just like a SQL table. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Apache Iceberg integration is supported by AWS analytics services including Amazon EMR, Amazon Athena, and AWS Glue. Amazon EMR can provision clusters with Spark, Hive, Trino, and Flink that can run Iceberg. Starting with Amazon EMR version 6.5.0, you can use Iceberg with your EMR cluster without requiring a bootstrap action. In early 2022, AWS announced general availability of Athena ACID transactions, powered by Apache Iceberg. The recently released Athena query engine version 3 provides better integration with the Iceberg table format. AWS Glue 3.0 and later supports the Apache Iceberg framework for data lakes.

In this post, we discuss what customers want in modern data lakes and how Apache Iceberg helps address customer needs. Then we walk through a solution to build a high-performance and evolving Iceberg data lake on Amazon Simple Storage Service (Amazon S3) and process incremental data by running insert, update, and delete SQL statements. Finally, we show you how to performance tune the process to improve read and write performance.

How Apache Iceberg addresses what customers want in modern data lakes

More and more customers are building data lakes, with structured and unstructured data, to support many users, applications, and analytics tools. There is an increased need for data lakes to support database like features such as ACID transactions, record-level updates and deletes, time travel, and rollback. Apache Iceberg is designed to support these features on cost-effective petabyte-scale data lakes on Amazon S3.

Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created. There are three layers in the architecture of an Iceberg table: the Iceberg catalog, the metadata layer, and the data layer, as depicted in the following figure (source).

The Iceberg catalog stores the metadata pointer to the current table metadata file. When a select query is reading an Iceberg table, the query engine first goes to the Iceberg catalog, then retrieves the location of the current metadata file. Whenever there is an update to the Iceberg table, a new snapshot of the table is created, and the metadata pointer points to the current table metadata file.

The following is an example Iceberg catalog with AWS Glue implementation. You can see the database name, the location (S3 path) of the Iceberg table, and the metadata location.

The metadata layer has three types of files: the metadata file, manifest list, and manifest file in a hierarchy. At the top of the hierarchy is the metadata file, which stores information about the table’s schema, partition information, and snapshots. The snapshot points to the manifest list. The manifest list has the information about each manifest file that makes up the snapshot, such as location of the manifest file, the partitions it belongs to, and the lower and upper bounds for partition columns for the data files it tracks. The manifest file tracks data files as well as additional details about each file, such as the file format. All three files work in a hierarchy to track the snapshots, schema, partitioning, properties, and data files in an Iceberg table.

The data layer has the individual data files of the Iceberg table. Iceberg supports a wide range of file formats including Parquet, ORC, and Avro. Because the Iceberg table tracks the individual data files instead of only pointing to the partition location with data files, it isolates the writing operations from reading operations. You can write the data files at any time, but only commit the change explicitly, which creates a new version of the snapshot and metadata files.

Solution overview

In this post, we walk you through a solution to build a high-performing Apache Iceberg data lake on Amazon S3; process incremental data with insert, update, and delete SQL statements; and tune the Iceberg table to improve read and write performance. The following diagram illustrates the solution architecture.

To demonstrate this solution, we use the Amazon Customer Reviews dataset in an S3 bucket (s3://amazon-reviews-pds/parquet/). In real use case, it would be raw data stored in your S3 bucket. We can check the data size with the following code in the AWS Command Line Interface (AWS CLI):

//Run this AWS CLI command to check the data size
aws s3 ls --summarize --human-readable --recursive s3://amazon-reviews-pds/parquet

The total object count is 430, and total size is 47.4 GiB.

To set up and test this solution, we complete the following high-level steps:

  1. Set up an S3 bucket in the curated zone to store converted data in Iceberg table format.
  2. Launch an EMR cluster with appropriate configurations for Apache Iceberg.
  3. Create a notebook in EMR Studio.
  4. Configure the Spark session for Apache Iceberg.
  5. Convert data to Iceberg table format and move data to the curated zone.
  6. Run insert, update, and delete queries in Athena to process incremental data.
  7. Carry out performance tuning.

Prerequisites

To follow along with this walkthrough, you must have an AWS account with an AWS Identity and Access Management (IAM) role that has sufficient access to provision the required resources.

Set up the S3 bucket for Iceberg data in the curated zone in your data lake

Choose the Region in which you want to create the S3 bucket and provide a unique name:

s3://iceberg-curated-blog-data

Launch an EMR cluster to run Iceberg jobs using Spark

You can create an EMR cluster from the AWS Management Console, Amazon EMR CLI, or AWS Cloud Development Kit (AWS CDK). For this post, we walk you through how to create an EMR cluster from the console.

  1. On the Amazon EMR console, choose Create cluster.
  2. Choose Advanced options.
  3. For Software Configuration, choose the latest Amazon EMR release. As of January 2023, the latest release is 6.9.0. Iceberg requires release 6.5.0 and above.
  4. Select JupyterEnterpriseGateway and Spark as the software to install.
  5. For Edit software settings, select Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  6. Leave other settings at their default and choose Next.
  7. For Hardware, use the default setting.
  8. Choose Next.
  9. For Cluster name, enter a name. We use iceberg-blog-cluster.
  10. Leave the remaining settings unchanged and choose Next.
  11. Choose Create cluster.

Create a notebook in EMR Studio

We now walk you through how to create a notebook in EMR Studio from the console.

  1. On the IAM console, create an EMR Studio service role.
  2. On the Amazon EMR console, choose EMR Studio.
  3. Choose Get started.

The Get started page appears in a new tab.

  1. Choose Create Studio in the new tab.
  2. Enter a name. We use iceberg-studio.
  3. Choose the same VPC and subnet as those for the EMR cluster, and the default security group.
  4. Choose AWS Identity and Access Management (IAM) for authentication, and choose the EMR Studio service role you just created.
  5. Choose an S3 path for Workspaces backup.
  6. Choose Create Studio.
  7. After the Studio is created, choose the Studio access URL.
  8. On the EMR Studio dashboard, choose Create workspace.
  9. Enter a name for your Workspace. We use iceberg-workspace.
  10. Expand Advanced configuration and choose Attach Workspace to an EMR cluster.
  11. Choose the EMR cluster you created earlier.
  12. Choose Create Workspace.
  13. Choose the Workspace name to open a new tab.

In the navigation pane, there is a notebook that has the same name as the Workspace. In our case, it is iceberg-workspace.

  1. Open the notebook.
  2. When prompted to choose a kernel, choose Spark.

Configure a Spark session for Apache Iceberg

Use the following code, providing your own S3 bucket name:

%%configure -f
{
"conf": {
"spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.demo.warehouse": "s3://iceberg-curated-blog-data",
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.demo.io-impl":"org.apache.iceberg.aws.s3.S3FileIO"
}
}

This sets the following Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which uses the Iceberg Spark catalog plugin.
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog uses AWS Glue as the physical catalog to store Iceberg database and table information.
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog stores all Iceberg metadata and data files under the root path defined by this property: s3://iceberg-curated-blog-data.
  • spark.sql.extensions – Adds support to Iceberg Spark SQL extensions, which allows you to run Iceberg Spark procedures and some Iceberg-only SQL commands (you use this in a later step).
  • spark.sql.catalog.demo.io-impl – Iceberg allows users to write data to Amazon S3 through S3FileIO. The AWS Glue Data Catalog by default uses this FileIO, and other catalogs can load this FileIO using the io-impl catalog property.

Convert data to Iceberg table format

You can use either Spark on Amazon EMR or Athena to load the Iceberg table. In the EMR Studio Workspace notebook Spark session, run the following commands to load the data:

// create a database in AWS Glue named reviews if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.reviews")

// load reviews - this load all the parquet files
val reviews_all_location = "s3://amazon-reviews-pds/parquet/"
val reviews_all = spark.read.parquet(reviews_all_location)

// write reviews data to an Iceberg v2 table
reviews_all.writeTo("demo.reviews.all_reviews").tableProperty("format-version", "2").createOrReplace()

After you run the code, you should find two prefixes created in your data warehouse S3 path (s3://iceberg-curated-blog-data/reviews.db/all_reviews): data and metadata.

Process incremental data using insert, update, and delete SQL statements in Athena

Athena is a serverless query engine that you can use to perform read, write, update, and optimization tasks against Iceberg tables. To demonstrate how the Apache Iceberg data lake format supports incremental data ingestion, we run insert, update, and delete SQL statements on the data lake.

Navigate to the Athena console and choose Query editor. If this is your first time using the Athena query editor, you need to configure the query result location to be the S3 bucket you created earlier. You should be able to see that the table reviews.all_reviews is available for querying. Run the following query to verify that you have loaded the Iceberg table successfully:

select * from reviews.all_reviews limit 5;

Process incremental data by running insert, update, and delete SQL statements:

//Example update statement
update reviews.all_reviews set star_rating=5 where product_category = 'Watches' and star_rating=4

//Example delete statement
delete from reviews.all_reviews where product_category = 'Watches' and star_rating=1

Performance tuning

In this section, we walk through different ways to improve Apache Iceberg read and write performance.

Configure Apache Iceberg table properties

Apache Iceberg is a table format, and it supports table properties to configure table behavior such as read, write, and catalog. You can improve the read and write performance on Iceberg tables by adjusting the table properties.

For example, if you notice that you write too many small files for an Iceberg table, you can config the write file size to write fewer but bigger size files, to help improve query performance.

Property Default Description
write.target-file-size-bytes 536870912 (512 MB) Controls the size of files generated to target about this many bytes

Use the following code to alter the table format:

//Example code to alter table format in EMR Studio Workspace notebook
spark.sql("ALTER TABLE demo.reviews.all_reviews 
SET TBLPROPERTIES ('write_target_data_file_size_bytes'='536870912')")

Partitioning and sorting

To make a query run fast, the less data read the better. Iceberg takes advantage of the rich metadata it captures at write time and facilitates techniques such as scan planning, partitioning, pruning, and column-level stats such as min/max values to skip data files that don’t have match records. We walk you through how query scan planning and partitioning work in Iceberg and how we use them to improve query performance.

Query scan planning

For a given query, the first step in a query engine is scan planning, which is the process to find the files in a table needed for a query. Planning in an Iceberg table is very efficient, because Iceberg’s rich metadata can be used to prune metadata files that aren’t needed, in addition to filtering data files that don’t contain matching data. In our tests, we observed Athena scanned 50% or less data for a given query on an Iceberg table compared to original data before conversion to Iceberg format.

There are two types of filtering:

  • Metadata filtering – Iceberg uses two levels of metadata to track the files in a snapshot: the manifest list and manifest files. It first uses the manifest list, which acts as an index of the manifest files. During planning, Iceberg filters manifests using the partition value range in the manifest list without reading all the manifest files. Then it uses selected manifest files to get data files.
  • Data filtering – After selecting the list of manifest files, Iceberg uses the partition data and column-level stats for each data file stored in manifest files to filter data files. During planning, query predicates are converted to predicates on the partition data and applied first to filter data files. Then, the column stats like column-level value counts, null counts, lower bounds, and upper bounds are used to filter out data files that can’t match the query predicate. By using upper and lower bounds to filter data files at planning time, Iceberg greatly improves query performance.

Partitioning and sorting

Partitioning is a way to group records with the same key column values together in writing. The benefit of partitioning is faster queries that access only part of the data, as explained earlier in query scan planning: data filtering. Iceberg makes partitioning simple by supporting hidden partitioning, in the way that Iceberg produces partition values by taking a column value and optionally transforming it.

In our use case, we first run the following query on the Iceberg table not partitioned. Then we partition the Iceberg table by the category of the reviews, which will be used in the query WHERE condition to filter out records. With partitioning, the query could scan much less data. See the following code:

//Example code in EMR Studio Workspace notebook to create an Iceberg table all_reviews_partitioned partitioned by product_category
reviews_all.writeTo("demo.reviews.all_reviews_partitioned").tableProperty("format-version", "2").partitionedBy($"product_category").createOrReplace()

Run the following select statement on the non-partitioned all_reviews table vs. the partitioned table to see the performance difference:

//Run this query on all_reviews table and the partitioned table for performance testing
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

//Run the same select query on partitioned dataset
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews_partitioned where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The following table shows the performance improvement of data partitioning, with about 50% performance improvement and 70% less data scanned.

Dataset Name Non-Partitioned Dataset Partitioned Dataset
Runtime (seconds) 8.20 4.25
Data Scanned (MB) 131.55 33.79

Note that the runtime is the average runtime with multiple runs in our test.

We saw good performance improvement after partitioning. However, this can be further improved by using column-level stats from Iceberg manifest files. In order to use the column-level stats effectively, you want to further sort your records based on the query patterns. Sorting the whole dataset using the columns that are often used in queries will reorder the data in such a way that each data file ends up with a unique range of values for the specific columns. If these columns are used in the query condition, it allows query engines to further skip data files, thereby enabling even faster queries.

Copy-on-write vs. read-on-merge

When implementing update and delete on Iceberg tables in the data lake, there are two approaches defined by the Iceberg table properties:

  • Copy-on-write – With this approach, when there are changes to the Iceberg table, either updates or deletes, the data files associated with the impacted records will be duplicated and updated. The records will be either updated or deleted from the duplicated data files. A new snapshot of the Iceberg table will be created and pointing to the newer version of data files. This makes the overall writes slower. There might be situations that concurrent writes are needed with conflicts so retry has to happen, which increases the write time even more. On the other hand, when reading the data, there is no extra process needed. The query will retrieve data from the latest version of data files.
  • Merge-on-read – With this approach, when there are updates or deletes on the Iceberg table, the existing data files will not be rewritten; instead new delete files will be created to track the changes. For deletes, a new delete file will be created with the deleted records. When reading the Iceberg table, the delete file will be applied to the retrieved data to filter out the delete records. For updates, a new delete file will be created to mark the updated records as deleted. Then a new file will be created for those records but with updated values. When reading the Iceberg table, both the delete and new files will be applied to the retrieved data to reflect the latest changes and produce the correct results. So, for any subsequent queries, an extra step to merge the data files with the delete and new files will happen, which will usually increase the query time. On the other hand, the writes might be faster because there is no need to rewrite the existing data files.

To test the impact of the two approaches, you can run the following code to set the Iceberg table properties:

//Run code to alter Iceberg table property to set copy-on-write and merge-on-read in EMR Studio Workspace notebook
spark.sql(“ALTER TABLE demo.reviews.all_reviews 
SET TBLPROPERTIES (‘write.delete.mode’=’copy-on-write’,’write.update.mode’=’copy-on-write’)”)

Run the update, delete, and select SQL statements in Athena to show the runtime difference for copy-on-write vs. merge-on-read:

//Example update statement
update reviews.all_reviews set star_rating=5 where product_category = ‘Watches’ and star_rating=4

//Example delete statement
delete from reviews.all_reviews where product_category = ‘Watches’ and star_rating=1

//Example select statement
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = ‘Watches’ and review_date between date(‘2005-01-01’) and date(‘2005-03-31’)

The following table summarizes the query runtimes.

Query Copy-on-Write Merge-on-Read
UPDATE DELETE SELECT UPDATE DELETE SELECT
Runtime (seconds) 66.251 116.174 97.75 10.788 54.941 113.44
Data scanned (MB) 494.06 3.07 137.16 494.06 3.07 137.16

Note that the runtime is the average runtime with multiple runs in our test.

As our test results show, there are always trade-offs in the two approaches. Which approach to use depends on your use cases. In summary, the considerations come down to latency on the read vs. write. You can reference the following table and make the right choice.

. Copy-on-Write Merge-on-Read
Pros Faster reads Faster writes
Cons Expensive writes Higher latency on reads
When to use Good for frequent reads, infrequent updates and deletes or large batch updates Good for tables with frequent updates and deletes

Data compaction

If your data file size is small, you might end up with thousands or millions of files in an Iceberg table. This dramatically increases the I/O operation and slows down the queries. Furthermore, Iceberg tracks each data file in a dataset. More data files lead to more metadata. This in turn increases the overhead and I/O operation on reading metadata files. In order to improve the query performance, it’s recommended to compact small data files to larger data files.

When updating and deleting records in Iceberg table, if the read-on-merge approach is used, you might end up with many small deletes or new data files. Running compaction will combine all these files and create a newer version of the data file. This eliminates the need to reconcile them during reads. It’s recommended to have regular compaction jobs to impact reads as little as possible while still maintaining faster write speed.

Run the following data compaction command, then run the select query from Athena:

//Data compaction 
optimize reviews.all_reviews REWRITE DATA USING BIN_PACK

//Run this query before and after data compaction
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The following table compares the runtime before vs. after data compaction. You can see about 40% performance improvement.

Query Before Data Compaction After Data Compaction
Runtime (seconds) 97.75 32.676 seconds
Data scanned (MB) 137.16 M 189.19 M

Note that the select queries ran on the all_reviews table after update and delete operations, before and after data compaction. The runtime is the average runtime with multiple runs in our test.

Clean up

After you follow the solution walkthrough to perform the use cases, complete the following steps to clean up your resources and avoid further costs:

  1. Drop the AWS Glue tables and database from Athena or run the following code in your notebook:
// DROP the table 
spark.sql("DROP TABLE demo.reviews.all_reviews") 
spark.sql("DROP TABLE demo.reviews.all_reviews_partitioned") 

// DROP the database 
spark.sql("DROP DATABASE demo.reviews")
  1. On the EMR Studio console, choose Workspaces in the navigation pane.
  2. Select the Workspace you created and choose Delete.
  3. On the EMR console, navigate to the Studios page.
  4. Select the Studio you created and choose Delete.
  5. On the EMR console, choose Clusters in the navigation pane.
  6. Select the cluster and choose Terminate.
  7. Delete the S3 bucket and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we introduced the Apache Iceberg framework and how it helps resolve some of the challenges we have in a modern data lake. Then we walked you though a solution to process incremental data in a data lake using Apache Iceberg. Finally, we had a deep dive into performance tuning to improve read and write performance for our use cases.

We hope this post provides some useful information for you to decide whether you want to adopt Apache Iceberg in your data lake solution.


About the Authors

Flora Wu is a Sr. Resident Architect at AWS Data Lab. She helps enterprise customers create data analytics strategies and build solutions to accelerate their businesses outcomes. In her spare time, she enjoys playing tennis, dancing salsa, and traveling.

Daniel Li is a Sr. Solutions Architect at Amazon Web Services. He focuses on helping customers develop, adopt, and implement cloud services and strategy. When not working, he likes spending time outdoors with his family.

Three ways to boost your email security and brand reputation with AWS

Post Syndicated from Michael Davie original https://aws.amazon.com/blogs/security/three-ways-to-boost-your-email-security-and-brand-reputation-with-aws/

If you own a domain that you use for email, you want to maintain the reputation and goodwill of your domain’s brand. Several industry-standard mechanisms can help prevent your domain from being used as part of a phishing attack. In this post, we’ll show you how to deploy three of these mechanisms, which visually authenticate emails sent from your domain to users and verify that emails are encrypted in transit. It can take as little as 15 minutes to deploy these mechanisms on Amazon Web Services (AWS), and the result can help to provide immediate and long-term improvements to your organization’s email security.

Phishing through email remains one of the most common ways that bad actors try to compromise computer systems. Incidents of phishing and related crimes far outnumber the incidents of other categories of internet crime, according to the most recent FBI Internet Crime Report. Phishing has consistently led to large annual financial losses in the US and globally.

Overview of BIMI, MTA-STS, and TLS reporting

An earlier post has covered how you can use Amazon Simple Email Service (Amazon SES) to send emails that align with best practices, including the IETF internet standards: Sender Policy Framework (SPF), DomainKeys Identified Mail (DKIM), and Domain-based Message Authentication, Reporting, and Conformance (DMARC). This post will show you how to build on this foundation and configure your domains to align with additional email security standards, including the following:

  • Brand Indicators for Message Identification (BIMI) – This standard allows you to associate a logo with your email domain, which some email clients will display to users in their inbox. Visit the BIMI Group’s Where is my BIMI Logo Displayed? webpage to see how logos are displayed in the user interfaces of BIMI-supporting mailbox providers; Figure 1 shows a mock-up of a typical layout that contains a logo.
  • Mail Transfer Agent Strict Transport Security (MTA-STS) – This standard helps ensure that email servers always use TLS encryption and certificate-based authentication when they send messages to your domain, to protect the confidentiality and integrity of email in transit.
  • SMTP TLS reporting – This reporting allows you to receive reports and monitor your domain’s TLS security posture, identify problems, and learn about attacks that might be occurring.
Figure 1: A mock-up of how BIMI enables branded logos to be displayed in email user interfaces

Figure 1: A mock-up of how BIMI enables branded logos to be displayed in email user interfaces

These three standards require your Domain Name System (DNS) to publish specific records, for example by using Amazon Route 53, that point to web pages that have additional information. You can host this information without having to maintain a web server by storing it in Amazon Simple Storage Service (Amazon S3) and delivering it through Amazon CloudFront, secured with a certificate provisioned from AWS Certificate Manager (ACM).

Note: This AWS solution works for DKIM, BIMI, and DMARC, regardless of what you use to serve the actual email for your domains, which services you use to send email, and where you host DNS. For purposes of clarity, this post assumes that you are using Route 53 for DNS. If you use a different DNS hosting provider, you will manually configure DNS records in your existing hosting provider.

Solution architecture

The architecture for this solution is depicted in Figure 2.

Figure 2: The architecture diagram showing how the solution components interact

Figure 2: The architecture diagram showing how the solution components interact

The interaction points are as follows:

  1. The web content is stored in an S3 bucket, and CloudFront has access to this bucket through an origin access identity, a mechanism of AWS Identity and Access Management (IAM).
  2. As described in more detail in the BIMI section of this blog post, the Verified Mark Certificate is obtained from a BIMI-qualified certificate authority and stored in the S3 bucket.
  3. When an external email system receives a message claiming to be from your domain, it looks up BIMI records for your domain in DNS. As depicted in the diagram, a DNS request is sent to Route 53.
  4. To retrieve the BIMI logo image and Verified Mark Certificate, the external email system will make HTTPS requests to a URL published in the BIMI DNS record. In this solution, the URL points to the CloudFront distribution, which has a TLS certificate provisioned with ACM.

A few important warnings

Email is a complex system of interoperating technologies. It is also brittle: a typo or a missing DNS record can make the difference between whether an email is delivered or not. Pay close attention to your email server and the users of your email systems when implementing the solution in this blog post. The main indicator that something is wrong is the absence of email. Instead of seeing an error in your email server’s log, users will tell you that they’re expecting to receive an email from somewhere and it’s not arriving. Or they will tell you that they sent an email, and their recipient can’t find it.

The DNS uses a lot of caching and time-out values to improve its efficiency. That makes DNS records slow and a little unpredictable as they propagate across the internet. So keep in mind that as you monitor your systems, it can be hours or even more than a day before the DNS record changes have an effect that you can detect.

This solution uses AWS Cloud Development Kit (CDK) custom resources, which are supported by AWS Lambda functions that will be created as part of the deployment. These functions are configured to use CDK-selected runtimes, which will eventually pass out of support and require you to update them.

Prerequisites

You will need permission in an AWS account to create and configure the following resources:

  • An Amazon S3 bucket to store the files and access logs
  • A CloudFront distribution to publicly deliver the files from the S3 bucket
  • A TLS certificate in ACM
  • An origin access identity in IAM that CloudFront will use to access files in Amazon S3
  • Lambda functions, IAM roles, and IAM policies created by CDK custom resources

You might also want to enable these optional services:

  • Amazon Route 53 for setting the necessary DNS records. If your domain is hosted by another DNS provider, you will set these DNS records manually.
  • Amazon SES or an Amazon WorkMail organization with a single mailbox. You can configure either service with a subdomain (for example, [email protected]) such that the existing domain is not disrupted, or you can create new email addresses by using your existing email mailbox provider.

BIMI has some additional requirements:

  • BIMI requires an email domain to have implemented a strong DMARC policy so that recipients can be confident in the authenticity of the branded logos. Your email domain must have a DMARC policy of p=quarantine or p=reject. Additionally, the domain’s policy cannot have sp=none or pct<100.

    Note: Do not adjust the DMARC policy of your domain without careful testing, because this can disrupt mail delivery.

  • You must have your brand’s logo in Scaled Vector Graphics (SVG) format that conforms to the BIMI standard. For more information, see Creating BIMI SVG Logo Files on the BIMI Group website.
  • Purchase a Verified Mark Certificate (VMC) issued by a third-party certificate authority. This certificate attests that the logo, organization, and domain are associated with each other, based on a legal trademark registration. Many email hosting providers require this additional certificate before they will show your branded logo to their users. Others do not currently support BIMI, and others might have alternative mechanisms to determine whether to show your logo. For more information about purchasing a Verified Mark Certificate, see the BIMI Group website.

    Note: If you are not ready to purchase a VMC, you can deploy this solution and validate that BIMI is correctly configured for your domain, but your branded logo will not display to recipients at major email providers.

What gets deployed in this solution?

This solution deploys the DNS records and supporting files that are required to implement BIMI, MTA-STS, and SMTP TLS reporting for an email domain. We’ll look at the deployment in more detail in the following sections.

BIMI

BIMI is described by the Internet Engineering Task Force (IETF) as follows:

Brand Indicators for Message Identification (BIMI) permits Domain Owners to coordinate with Mail User Agents (MUAs) to display brand-specific Indicators next to properly authenticated messages. There are two aspects of BIMI coordination: a scalable mechanism for Domain Owners to publish their desired Indicators, and a mechanism for Mail Transfer Agents (MTAs) to verify the authenticity of the Indicator. This document specifies how Domain Owners communicate their desired Indicators through the BIMI Assertion Record in DNS and how that record is to be interpreted by MTAs and MUAs. MUAs and mail-receiving organizations are free to define their own policies for making use of BIMI data and for Indicator display as they see fit.

If your organization has a trademark-protected logo, you can set up BIMI to have that logo displayed to recipients in their email inboxes. This can have a positive impact on your brand and indicates to end users that your email is more trustworthy. The BIMI Group shows examples of how brand logos are displayed in user inboxes, as well as a list of known email service providers that support the display of BIMI logos.

As a domain owner, you can implement BIMI by publishing the relevant DNS records and hosting the relevant files. To have your logo displayed by most email hosting providers, you will need to purchase a Verified Mark Certificate from a BIMI-qualified certificate authority.

This solution will deploy a valid BIMI record in Route 53 (or tell you what to publish in the DNS if you’re not using Route 53) and will store your provided SVG logo and Verified Mark Certificate files in Amazon S3, to be delivered through CloudFront with a valid TLS certificate from ACM.

To support BIMI, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host:
    default._bimi.<your-domain>. The value of this record is: v=BIMI1; l=<url-of-your-logo> a=<url-of-verified-mark-certificate>. The value of <your-domain> refers to the domain that is used in the From header of messages that your organization sends.
  2. The logo and optional Verified Mark Certificate are hosted publicly at the HTTPS locations defined by <url-of-your-logo> and <url-of-verified-mark-certificate>, respectively.

MTA-STS

MTA-STS is described by the IETF in RFC 8461 as follows:

SMTP (Simple Mail Transport Protocol) MTA Strict Transport Security (MTA-STS) is a mechanism enabling mail service providers to declare their ability to receive Transport Layer Security (TLS) secure SMTP connections and to specify whether sending SMTP servers should refuse to deliver to MX hosts that do not offer TLS with a trusted server certificate.

Put simply, MTA-STS helps ensure that email servers always use encryption and certificate-based authentication when sending email to your domains, so that message integrity and confidentiality are preserved while in transit across the internet. MTA-STS also helps to ensure that messages are only sent to authorized servers.

This solution will deploy a valid MTA-STS policy record in Route 53 (or tell you what value to publish in the DNS if you’re not using Route 53) and will create an MTA-STS policy document to be hosted on S3 and delivered through CloudFront with a valid TLS certificate from ACM.

To support MTA-STS, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host: _mta-sts.<your-domain>. The value of this record is: v=STSv1; id=<unique value used for cache invalidation>.
  2. The MTA-STS policy document is hosted at and obtained from the following location: https://mta-sts.<your-domain>/.well-known/mta-sts.txt.
  3. The value of <your-domain> in both cases is the domain that is used for routing inbound mail to your organization and is typically the same domain that is used in the From header of messages that your organization sends externally. Depending on the complexity of your organization, you might receive inbound mail for multiple domains, and you might choose to publish MTA-STS policies for each domain.

Is it ever bad to encrypt everything?

In the example MTA-STS policy file provided in the GitHub repository and explained later in this post, the MTA-STS policy mode is set to testing. This means that your email server is advertising its willingness to negotiate encrypted email connections, but it does not require TLS. Servers that want to send mail to you are allowed to connect and deliver mail even if there are problems in the TLS connection, as long as you’re in testing mode. You should expect reports when servers try to connect through TLS to your mail server and fail to do so.

Be fully prepared before you change the MTA-STS policy to enforce. After this policy is set to enforce, servers that follow the MTA-STS policy and that experience an enforceable TLS-related error when they try to connect to your mail server will not deliver mail to your mail server. This is a difficult situation to detect. You will simply stop receiving email from servers that comply with the policy. You might receive reports from them indicating what errors they encountered, but it is not guaranteed. Be sure that the email address you provide in SMTP TLS reporting (in the following section) is functional and monitored by people who can take action to fix issues. If you miss TLS failure reports, you probably won’t receive email. If the TLS certificate that you use on your email server expires, and your MTA-STS policy is set to enforce, this will become an urgent issue and will disrupt the flow of email until it is fixed.

SMTP TLS reporting

SMTP TLS reporting is described by the IETF in RFC 8460 as follows:

A number of protocols exist for establishing encrypted channels between SMTP Mail Transfer Agents (MTAs), including STARTTLS, DNS-Based Authentication of Named Entities (DANE) TLSA, and MTA Strict Transport Security (MTA-STS). These protocols can fail due to misconfiguration or active attack, leading to undelivered messages or delivery over unencrypted or unauthenticated channels. This document describes a reporting mechanism and format by which sending systems can share statistics and specific information about potential failures with recipient domains. Recipient domains can then use this information to both detect potential attacks and diagnose unintentional misconfigurations.

As you gain the security benefits of MTA-STS, SMTP TLS reporting will allow you to receive reports from other internet email providers. These reports contain information that is valuable when monitoring your TLS security posture, identifying problems, and learning about attacks that might be occurring.

This solution will deploy a valid SMTP TLS reporting record on Route 53 (or provide you with the value to publish in the DNS if you are not using Route 53).

To support SMTP TLS reporting, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host: _smtp._tls.<your-domain>. The value of this record is: v=TLSRPTv1; rua=mailto:<report-receiver-email-address>
  2. The value of <report-receiver-email-address> might be an address in your domain or in a third-party provider. Automated systems that process these reports must be capable of processing GZIP compressed files and parsing JSON.

Deploy the solution with the AWS CDK

In this section, you’ll learn how to deploy the solution to create the previously described AWS resources in your account.

  1. Clone the following GitHub repository:

    git clone https://github.com/aws-samples/serverless-mail
    cd serverless-mail/email-security-records

  2. Edit CONFIG.py to reflect your desired settings, as follows:
    1. If no Verified Mark Certificate is provided, set VMC_FILENAME = None.
    2. If your DNS zone is not hosted on Route 53, or if you do not want this app to manage Route 53 DNS records, set ROUTE_53_HOSTED = False. In this case, you will need to set TLS_CERTIFICATE_ARN to the Amazon Resource Name (ARN) of a certificate hosted on ACM in us-east-1. This certificate is used by CloudFront and must support two subdomains: mta-sts and your configured BIMI_ASSET_SUBDOMAIN.
  3. Finalize the preparation, as follows:
    1. Place your BIMI logo and Verified Mark Certificate files in the assets folder.
    2. Create an MTA-STS policy file at assets/.well-known/mta-sts.txt to reflect your mail exchange (MX) servers and policy requirements. An example file is provided at assets/.well-known/mta-sts.txt.example
  4. Deploy the solution, as follows:
    1. Open a terminal in the email-security-records folder.
    2. (Recommended) Create and activate a virtual environment by running the following commands.
      python3 -m venv .venv
      source .venv/bin/activate
    3. Install the Python requirements in your environment with the following command.
      pip install -r requirements.txt
    4. Assume a role in the target account that has the permissions outlined in the Prerequisites section of this post.

      Using AWS CDK version 2.17.0 or later, deploy the bootstrap in the target account by running the following command. To learn more, see Bootstrapping in the AWS CDK Developer Guide.
      cdk bootstrap

    5. Run the following command to synthesize the CloudFormation template. Review the output of this command to verify what will be deployed.
      cdk synth
    6. Run the following command to deploy the CloudFormation template. You will be prompted to accept the IAM changes that will be applied to your account.
      cdk deploy

      Note: If you use Route53, these records are created and activated in your DNS zones as soon as the CDK finishes deploying. As the records propagate through the DNS, they will gradually start affecting the email in the affected domains.

    7. If you’re not using Route53 and instead are using a third-party DNS provider, create the CNAME and TXT records as indicated. In this case, your email is not affected by this solution until you create the records in DNS.

Testing and troubleshooting

After you have deployed the CDK solution, you can test it to confirm that the DNS records and web resources are published correctly.

BIMI

  1. Query the BIMI DNS TXT record for your domain by using the dig or nslookup command in your terminal.

    dig +short TXT default._bimi.<your-domain.example>

    Verify the response. For example:

    "v=BIMI1; l=https://bimi-assets.<your-domain.example>/logo.svg"

  2. In your web browser, open the URL from that response (for example, https://bimi-assets.<your-domain.example>/logo.svg) to verify that the logo is available and that the HTTPS certificate is valid.
  3. The BIMI group provides a tool to validate your BIMI configuration. This tool will also validate your VMC if you have purchased one.

MTA-STS

  1. Query the MTA-STS DNS TXT record for your domain.

    dig +short TXT _mta-sts.<your-domain.example>

    The value of this record is as follows:

    v=STSv1; id=<unique value used for cache invalidation>

  2. You can load the MTA-STS policy document using your web browser. For example, https://mta-sts.<your-domain.example>/.well-known/mta-sts.txt
  3. You can also use third party tools to examine your MTA-STS configuration, such as MX Toolbox.

TLS reporting

  1. Query the TLS reporting DNS TXT record for your domain.

    dig +short TXT _smtp._tls.<your-domain.example>

    Verify the response. For example:

    "v=TLSRPTv1; rua=mailto:<your email address>"

  2. You can also use third party tools to examine your TLS reporting configuration, such as Easy DMARC.

Depending on which domains you communicate with on the internet, you will begin to see TLS reports arriving at the email address that you have defined in the TLS reporting DNS record. We recommend that you closely examine the TLS reports, and use automated analytical techniques over an extended period of time before changing the default testing value of your domain’s MTA-STS policy. Not every email provider will send TLS reports, but examining the reports in aggregate will give you a good perspective for making changes to your MTA-STS policy.

Cleanup

To remove the resources created by this solution:

  1. Open a terminal in the cdk-email-security-records folder.
  2. Assume a role in the target account with permission to delete resources.
  3. Run cdk destroy.

Note: The asset and log buckets are automatically emptied and deleted by the cdk destroy command.

Conclusion

When external systems send email to or receive email from your domains they will now query your new DNS records and will look up your domain’s BIMI, MTA-STS, and TLS reporting information from your new CloudFront distribution. By adopting the email domain security mechanisms outlined in this post, you can improve the overall security posture of your email environment, as well as the perception of your brand.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Davie

Michael Davie

Michael is a Senior Industry Specialist with AWS Security Assurance. He works with our customers, their regulators, and AWS teams to help raise the bar on secure cloud adoption and usage. Michael has over 20 years of experience working in the defence, intelligence, and technology sectors in Canada and is a licensed professional engineer.

Jesse Thompson

Jesse Thompson

Jesse is an Email Deliverability Manager with the Amazon Simple Email Service team. His background is in enterprise IT development and operations, with a focus on email abuse mitigation and encouragement of authenticity practices with open standard protocols. Jesse’s favorite activity outside of technology is recreational curling.

Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

Post Syndicated from Kachi Odoemene original https://aws.amazon.com/blogs/big-data/build-a-semantic-search-engine-for-tabular-columns-with-transformers-and-amazon-opensearch-service/

Finding similar columns in a data lake has important applications in data cleaning and annotation, schema matching, data discovery, and analytics across multiple data sources. The inability to accurately find and analyze data from disparate sources represents a potential efficiency killer for everyone from data scientists, medical researchers, academics, to financial and government analysts.

Conventional solutions involve lexical keyword search or regular expression matching, which are susceptible to data quality issues such as absent column names or different column naming conventions across diverse datasets (for example, zip_code, zcode, postalcode).

In this post, we demonstrate a solution for searching for similar columns based on column name, column content, or both. The solution uses approximate nearest neighbors algorithms available in Amazon OpenSearch Service to search for semantically similar columns. To facilitate the search, we create features representations (embeddings) for individual columns in the data lake using pre-trained Transformer models from the sentence-transformers library in Amazon SageMaker. Finally, to interact with and visualize results from our solution, we build an interactive Streamlit web application running on AWS Fargate.

We include a code tutorial for you to deploy the resources to run the solution on sample data or your own data.

Solution overview

The following architecture diagram illustrates the two-stage workflow for finding semantically similar columns. The first stage runs an AWS Step Functions workflow that creates embeddings from tabular columns and builds the OpenSearch Service search index. The second stage, or the online inference stage, runs a Streamlit application through Fargate. The web application collects input search queries and retrieves from the OpenSearch Service index the approximate k-most-similar columns to the query.

Solution architecture

Figure 1. Solution architecture

The automated workflow proceeds in the following steps:

  1. The user uploads tabular datasets into an Amazon Simple Storage Service (Amazon S3) bucket, which invokes an AWS Lambda function that initiates the Step Functions workflow.
  2. The workflow begins with an AWS Glue job that converts the CSV files into Apache Parquet data format.
  3. A SageMaker Processing job creates embeddings for each column using pre-trained models or custom column embedding models. The SageMaker Processing job saves the column embeddings for each table in Amazon S3.
  4. A Lambda function creates the OpenSearch Service domain and cluster to index the column embeddings produced in the previous step.
  5. Finally, an interactive Streamlit web application is deployed with Fargate. The web application provides an interface for the user to input queries to search the OpenSearch Service domain for similar columns.

You can download the code tutorial from GitHub to try this solution on sample data or your own data. Instructions on the how to deploy the required resources for this tutorial are available on Github.

Prerequistes

To implement this solution, you need the following:

  • An AWS account.
  • Basic familiarity with AWS services such as the AWS Cloud Development Kit (AWS CDK), Lambda, OpenSearch Service, and SageMaker Processing.
  • A tabular dataset to create the search index. You can bring your own tabular data or download the sample datasets on GitHub.

Build a search index

The first stage builds the column search engine index. The following figure illustrates the Step Functions workflow that runs this stage.

Step functions workflow

Figure 2 – Step functions workflow – multiple embedding models

Datasets

In this post, we build a search index to include over 400 columns from over 25 tabular datasets. The datasets originate from the following public sources:

For the the full list of the tables included in the index, see the code tutorial on GitHub.

You can bring your own tabular dataset to augment the sample data or build your own search index. We include two Lambda functions that initiate the Step Functions workflow to build the search index for individual CSV files or a batch of CSV files, respectively.

Transform CSV to Parquet

Raw CSV files are converted to Parquet data format with AWS Glue. Parquet is a column-oriented format file format preferred in big data analytics that provides efficient compression and encoding. In our experiments, the Parquet data format offered significant reduction in storage size compared to raw CSV files. We also used Parquet as a common data format to convert other data formats (for example JSON and NDJSON) because it supports advanced nested data structures.

Create tabular column embeddings

To extract embeddings for individual table columns in the sample tabular datasets in this post, we use the following pre-trained models from the sentence-transformers library. For additional models, see Pretrained Models.

Model name Dimension Size (MB)
all-MiniLM-L6-v2 384 80
all-distilroberta-v1 768 290
average_word_embeddings_glove.6B.300d 300 420

The SageMaker Processing job runs create_embeddings.py(code) for a single model. For extracting embeddings from multiple models, the workflow runs parallel SageMaker Processing jobs as shown in the Step Functions workflow. We use the model to create two sets of embeddings:

  • column_name_embeddings – Embeddings of column names (headers)
  • column_content_embeddings – Average embedding of all the rows in the column

For more information about the column embedding process, see the code tutorial on GitHub.

An alternative to the SageMaker Processing step is to create a SageMaker batch transform to get column embeddings on large datasets. This would require deploying the model to a SageMaker endpoint. For more information, see Use Batch Transform.

Index embeddings with OpenSearch Service

In the final step of this stage, a Lambda function adds the column embeddings to a OpenSearch Service approximate k-Nearest-Neighbor (kNN) search index. Each model is assigned its own search index. For more information about the approximate kNN search index parameters, see k-NN.

Online inference and semantic search with a web app

The second stage of the workflow runs a Streamlit web application where you can provide inputs and search for semantically similar columns indexed in OpenSearch Service. The application layer uses an Application Load Balancer, Fargate, and Lambda. The application infrastructure is automatically deployed as part of the solution.

The application allows you to provide an input and search for semantically similar column names, column content, or both. Additionally, you can select the embedding model and number of nearest neighbors to return from the search. The application receives inputs, embeds the input with the specified model, and uses kNN search in OpenSearch Service to search indexed column embeddings and find the most similar columns to the given input. The search results displayed include the table names, column names, and similarity scores for the columns identified, as well as the locations of the data in Amazon S3 for further exploration.

The following figure shows an example of the web application. In this example, we searched for columns in our data lake that have similar Column Names (payload type) to district (payload). The application used all-MiniLM-L6-v2 as the embedding model and returned 10 (k) nearest neighbors from our OpenSearch Service index.

The application returned transit_district, city, borough, and location as the four most similar columns based on the data indexed in OpenSearch Service. This example demonstrates the ability of the search approach to identify semantically similar columns across datasets.

Web application user interface

Figure 3: Web application user interface

Clean up

To delete the resources created by the AWS CDK in this tutorial, run the following command:

cdk destroy --all

Conclusion

In this post, we presented an end-to-end workflow for building a semantic search engine for tabular columns.

Get started today on your own data with our code tutorial available on GitHub. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon Machine Learning Solutions Lab.


About the Authors

Kachi Odoemene is an Applied Scientist at AWS AI. He builds AI/ML solutions to solve business problems for AWS customers.

Taylor McNally is a Deep Learning Architect at Amazon Machine Learning Solutions Lab. He helps customers from various industries build solutions leveraging AI/ML on AWS. He enjoys a good cup of coffee, the outdoors, and time with his family and energetic dog.

Austin Welch is a Data Scientist in the Amazon ML Solutions Lab. He develops custom deep learning models to help AWS public sector customers accelerate their AI and cloud adoption. In his spare time, he enjoys reading, traveling, and jiu-jitsu.

Enhance operational insights for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana

Post Syndicated from Anand Mandilwar original https://aws.amazon.com/blogs/big-data/enhance-operational-insights-for-amazon-msk-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an event streaming platform that you can use to build asynchronous applications by decoupling producers and consumers. Monitoring of different Amazon MSK metrics is critical for efficient operations of production workloads. Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch, where you can view them. You can also monitor Amazon MSK with Prometheus, an open-source monitoring application. Many of our customers use such open-source monitoring tools like Prometheus and Grafana, but doing it in self-managed environment comes with its own challenges regarding manageability, availability, and security.

In this post, we show how you can build an AWS Cloud native monitoring platform for Amazon MSK using the fully managed, highly available, scalable, and secure services Amazon Managed service for Prometheus and Amazon Managed Grafana for better operational insights.

Why is Kafka monitoring critical?

As a critical component of the IT infrastructure, it is necessary to track Amazon MSK clusters’ operations and their efficiencies. Amazon MSK metrics helps monitor critical tasks while operating applications. You can not only troubleshoot problems that have already occurred, but also discover anomalous behavior patterns and prevent problems from occurring in the first place.

Some customers currently use various third-party monitoring solutions like lenses.io, AppDynamics, Splunk, and others to monitor Amazon MSK operational metrics. In the context of cloud computing, customers are looking for an AWS Cloud native service that offers equivalent or better capabilities but with the added advantage of being highly scalable, available, secure, and fully managed.

Amazon MSK clusters emit a very large number of metrics via JMX, many of which can be useful for tuning the performance of your cluster, producers, and consumers. However, that large volume brings complexity with monitoring. By default, Amazon MSK clusters come with CloudWatch monitoring of your essential metrics. You can extend your monitoring capabilities by using open-source monitoring with Prometheus. This feature enables you to scrape a Prometheus friendly API to gather all the JMX metrics and work with the data in Prometheus.

This solution provides a simple and easy observability platform for Amazon MSK along with much needed insights into various critical operational metrics that yields the following organizational benefits for your IT operations or application teams:

  • You can quickly drill down to various Amazon MSK components (broker level, topic level, or cluster level) and identify issues that need investigation
  • You can investigate Amazon MSK issues after the event using the historical data in Amazon Managed Service for Prometheus
  • You can shorten or eliminate long calls that waste time questioning business users on Amazon MSK issues

In this post, we set up Amazon Managed Service for Prometheus, Amazon Managed Grafana, and a Prometheus server running as container on Amazon Elastic Compute Cloud (Amazon EC2) to provide a fully managed monitoring solution for Amazon MSK.

The solution provides an easy-to-configure dashboard in Amazon Managed Grafana for various critical operation metrics, as demonstrated in the following video.

Solution overview

Amazon Managed Service for Prometheus reduces the heavy lifting required to get started with monitoring applications across Amazon MSK, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Fargate, as well as self-managed Kubernetes clusters. The service also seamlessly integrates with Amazon Managed Grafana to simplify data visualization, team management authentication, and authorization.

Grafana empowers you to create dashboards and alerts from multiple sources such as an Amazon Managed Prometheus workspace, CloudWatch, AWS X-Ray, Amazon OpenSearch Service, Amazon Redshift, and Amazon Timestream.

The following diagram demonstrates the solution architecture. This solution deploys a Prometheus server running as a container within Amazon EC2, which constantly scrapes metrics from the MSK brokers and remote write metrics to an Amazon Managed Service for Prometheus workspace. As of this writing, Amazon Managed Service for Prometheus is not able to scrape the metrics directly, therefore a Prometheus server is necessary to do so. We use Amazon Managed Grafana to query and visualize the operational metrics for the Amazon MSK platform.

Solution Architecture

The following are the high-level steps to deploy the solution:

  1. Create an EC2 key pair.
  2. Configure your Amazon MSK cluster and associated resources. We demonstrate how to configure an existing Amazon MSK cluster or create a new one.
    1. Option A:- Modify an existing Amazon MSK cluster
    2. Option B:- Create a new Amazon MSK cluster
  3. Enable AWS IAM Identity Center (successor to AWS Single Sign-On), if not enabled.
  4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus.
  5. Configure Prometheus and start the service.
  6. Configure the data sources in Amazon Managed Grafana.
  7. Import the Grafana dashboard.

Prerequisites

git clone https://github.com/aws-samples/amazonmsk-managed-observability
  • You download three CloudFormation template files along with the Prometheus configuration file (prometheus.yml), targets.json file (you need this to update the MSK broker DNS later on), and three JSON files for creating a dashboard within Amazon Managed Grafana.
  • Make sure internet connection is allowed to download docker image of Prometheus from within Prometheus server

1. Create an EC2 key pair

To create your EC2 key pair, complete the following steps:

  1. On the Amazon EC2 console, under Network & Security in the navigation pane, choose Key Pairs.
  2. Choose Create key pair.
  3. For Name, enter DemoMSKKeyPair.
  4. For Key pair type¸ select RSA.
  5. For Private key file format, choose the format in which to save the private key:
    1. To save the private key in a format that can be used with OpenSSH, select .pem.
    2. To save the private key in a format that can be used with PuTTY, select .ppk.

Create EC2 Key Pair - DemoMSKKeyPair

The private key file is automatically downloaded by your browser. The base file name is the name that you specified as the name of your key pair, and the file name extension is determined by the file format that you chose.

  1. Save the private key file in a safe place.

2. Configure your Amazon MSK cluster and associated resources.

Using the following options to configure an existing Amazon MSK cluster or create a new one.

2.a Modify an existing Amazon MSK cluster

If you want to create a new Amazon MSK cluster for this solution, skip to the section – 2.b.Create a new Amazon MSK cluster, otherwise complete the steps in this section to modify an existing cluster.

Validate cluster monitoring settings

We must enable enhanced partition-level monitoring (available at an additional cost) and open monitoring with Prometheus. Note that open monitoring with Prometheus is only available for provisioned mode clusters.

  1. Sign in to the account where the Amazon MSK cluster is that you want to monitor.
  2. Open your Amazon MSK cluster.
  3. On the Properties tab, navigate to Monitoring metrics.
  4. Check the monitoring level for Amazon CloudWatch metrics for this cluster, and choose Edit to edit the cluster.
  5. Select Enhance partition-level monitoring.

Editing monitoring attributes of Amazon MSK Cluster

  1. Check the monitoring label for Open monitoring with Prometheus, and choose Edit to edit the cluster.
  2. Select Enable open monitoring for Prometheus.
  3. Under Prometheus exporters, select JMX Exporter and Note Exporter.

enable Open monitoring with Prometheus

  1. Under Broker log delivery, select Deliver to Amazon CloudWatch Logs.
  2. For Log group, enter your log group for Amazon MSK.
  3. Choose Save changes.

Deploy CloudFormation stack

Now we deploy the CloudFormation stack Prometheus_Cloudformation.yml that we downloaded earlier.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the Prometheus_Cloudformation.yml file, then choose Next.

CloudFormation - Create Stack for Prometheus_Cloudformation.yml

  1. For Stack name, enter Prometheus.
  2. VPCID – Provide the VPC ID where your Amazon MSK cluster is deployed (mandatory)
  3. VPCCIdr – Provide the VPC CIDR where your Amazon MSK Cluster is deployed (mandatory)
  4. SubnetID – Provide any one of the subnets ID where your existing Amazon MSK cluster is deployed (mandatory)
  5. MSKClusterName – Provide the name your existing Amazon MSK Cluster
  6. Leave Cloud9InstanceType, KeyName, and LatestAmild as default.
  7. Choose Next.

CloudFormation - Specify stack details for Prometheus Stack

  1. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  2. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

CloudFormation stack creation status for Prometheus Stack

  1. On the stack’s Outputs tab, note the values for the following keys (if you don’t see anything under Outputs tab, click on refresh icon):
    1. PrometheusInstancePrivateIP
    2. PrometheusSecurityGroupId

getting private IP and Security Group ID for Prometheus Stack

Update the Amazon MSK cluster security group

Complete the following steps to update the security group of the existing Amazon MSK cluster to allow communication from the Kafka client and Prometheus server:

  1. On the Amazon MSK console, navigate to your Amazon MSK cluster.
  2. On the Properties tab, under Network settings, open the security group.
  3. Choose Edit inbound rules.
  4. Choose Add rule and create your rule with the following parameters:
    1. Type – Custom TCP
    2. Port range – 11001–11002
    3. Source – The Prometheus server security group ID

Set up your AWS Cloud9 environment

To configure your AWS Cloud9 environment, complete the following steps:

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Select Cloud9EC2Bastion and choose Open in Cloud9.

AWS Cloud9 home page

  1. Close the Welcome tab and open a new terminal tab
  2. Create an SSH key file with the contents from the private key file DemoMSKKeyPair using the following command:
    touch /home/ec2-user/environment/EC2KeyMSKDemo

  3. Run the following command to list the newly created key file
    ls -ltr

  4. Open the file, enter the contents of the private key file DemoMSKKeyPair, then save the file.

Updating the key file with the private file content

  1. Change the permissions of the file using the following command:
    chmod 600 /home/ec2-user/environment/EC2KeyMSKDemo

  2. Log in to the Prometheus server using this key file and the private IP noted earlier:
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus-instance>

  3. Once you’re logged in, check if the Docker service is up and running using the following command:
    systemctl status docker

checking docker service status

  1. To exit the server, enter exit and press Enter.

2.b Create a new Amazon MSK cluster

If you don’t have an Amazon MSK cluster running in your environment, or you don’t want to use an existing cluster for this solution, complete the steps in this section.

As part of these steps, your cluster will have the following properties:

Deploy CloudFormation stack

Complete the following steps to deploy the CloudFormation stack MSKResource_Cloudformation.yml:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the MSKResource_Cloudformation.yml file, then choose Next.
  6. For Stack name, enter MSKDemo.
  7. Network Configuration – Generic (mandatory)
    1. Stack to be deployed in NEW VPC? (true/false) – if false, you MUST provide VPCCidr and other details under Existing VPC section (Default is true)
    2. VPCCidr – Default is 10.0.0.0/16 for a new VPC. You can have any valid values as per your environment. If deploying in an existing VPC, provide the CIDR for the same
  8. Network Configuration – For New VPC
    1. PrivateSubnetMSKOneCidr (Default is 10.0.1.0/24)
    2. PrivateSubnetMSKTwoCidr (Default is 10.0.2.0/24)
    3. PrivateSubnetMSKThreeCidr (Default is 10.0.3.0/24)
    4. PublicOneCidr (Default is 10.0.0.0/24)
  9. Network Configuration – For Existing VPC (You need at least 4 subnets)
    1. VpcId – Provide the value if you are using any existing VPC to deploy the resources else leave it blank(default)
    2. SubnetID1 – Any one of the existing subnets from the given VPCID
    3. SubnetID2 – Any one of the existing subnets from the given VPCID
    4. SubnetID3 – Any one of the existing subnets from the given VPCID
    5. PublicSubnetID – Any one of the existing subnets from the given VPCID
  10. Leave the remaining parameters as default and choose Next.

Specifying the parameter details for MSKDemo stack

  1. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  2. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

CloudFormation stack creation status

  1. On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
    1. KafkaClientPrivateIP
    2. PrometheusInstancePrivateIP

Set up your AWS Cloud9 environment

Follow the steps as outlined in the previous section to configure your AWS Cloud9 environment.

Retrieve the cluster broker list

To get your MSK cluster broker list, complete the following steps:

  1. On the Amazon MSK console, navigate to your cluster.
  2. In the Cluster summary section, choose View client information.
  3. In the Bootstrap servers section, copy the private endpoint.

You need this value to perform some operations later, such as creating an MSK topic, producing sample messages, and consuming those sample messages.

  1. Choose Done.
  2. On the Properties tab, in the Brokers details section, note the endpoints listed.

These need to be updated in the targets.json file (used for Prometheus configuration in a later step).

3. Enable IAM Identity Center

Before you deploy the CloudFormation stack for Amazon Managed Service for Prometheus and Amazon Managed Grafana, make sure to enable IAM Identity Center.

If you don’t use IAM Identity Center, alternatively, you can set up user authentication via SAML. For more information, refer to Using SAML with your Amazon Managed Grafana workspace.

If IAM Identity Center is currently enabled/configured in another region, you don’t need to enable in your current region.

Complete the following steps to enable IAM Identity Center:

  1. On the IAM Identity Center console, under Enable IAM Identity Center, choose Enable.

Enabling IAM Identity Center

  1. Choose Create AWS organization.

4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus

Complete the steps in this section to set up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Deploy CloudFormation template

Complete the following steps to deploy the CloudFormation stack AMG_AMP_Cloudformation:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the AMG_AMP_Cloudformation.yml file, then choose Next.
  6. For Stack name, enter ManagedPrometheusAndGrafanaStack, then choose Next.
  7. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

  1. On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
    1. GrafanaWorkspaceURL – This is Amazon Managed Grafana URL
    2. PrometheusEndpointWriteURL – This is the Amazon Managed Service for Prometheus write endpoint URL

Outputs information for ManagedPromethusAndGrafana Stack

Create a user for Amazon Managed Grafana

Complete the following steps to create a user for Amazon Managed Grafana:

  1. On the IAM Identity Center console, choose Users in the navigation pane.
  2. Choose Add user.
  3. For Username, enter grafana-admin.
  4. Enter and confirm your email address to receive a confirmation email.

IAM Identity Center - Specifying user information

  1. Skip the optional steps, then choose Add user.

A success message appears at the top of the console.

  1. In the confirmation email, choose Accept invitation and set your user password.

Screenshot showing Invitation link for the new user

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Open the workspace Amazon-Managed-Grafana.
  3. Make a note of the Grafana workspace URL.

You use this URL to log in to view your Grafana dashboards.

  1. On the Authentication tab, choose Assign new user or group.

Amazon Managed Grafana - Workspace information - Assigning user

  1. Select the user you created earlier and choose Assign users and groups.

Amazon Managed Grafana - Assigning user to workspace

  1. On the Action menu, choose what kind of user to make it: admin, editor, or viewer.

Note that your Grafana workspace needs as least one admin user.

  1. Navigate to the Grafana URL you copied earlier in your browser.
  2. Choose Sign in with AWS IAM Identity Center.

Amazon Managed Grafana URL landing page

  1. Log in with your IAM Identity Center credentials.

5. Configure Prometheus and start the service

When you cloned the GitHub repo, you downloaded two configuration files: prometheus.yml and targets.json. In this section, we configure these two files.

  1. Use any IDE (Visual Studio Code or Notepad++) to open prometheus.yml.
  2. In the remote_write section, update the remote write URL and Region.

prometheus.yml file

  1. Use any IDE to open targets.json.
  2. Update the targets with the broker endpoints you obtained earlier.

targets.json file

  1. In your AWS Cloud9 environment, choose File, then Upload Local Files.
  2. Choose Select Files and upload targets.json and prometheus.yml from your local machine.

Upload file dialog page within AWS Cloud9 Environment

  1. In the AWS Cloud9 environment, run the following command using the key file you created earlier:
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus instance> mkdir -p /home/ec2-user/Prometheus

  2. copy targets.json to the Prometheus server:
    scp -i /home/ec2-user/environment/EC2KeyMSKDemo targets.json ec2-user@<Private-IP-of-Prometheus-instance>:/home/ec2-user/prometheus/

  3. copy prometheus.yml to the Prometheus server:
    scp -i /home/ec2-user/environment/EC2KeyMSKDemo prometheus.yml ec2-user@<Private-IP-of-Prometheus-instance>:/home/ec2-user/prometheus/

  4. SSH into the Prometheus server and start the container service for Prometheus
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus instance>

  5. start the prometheus container
    sudo docker run -d -p 9090:9090 --name=prometheus -v /home/ec2-user/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /home/ec2-user/prometheus/targets.json:/etc/prometheus/targets.json prom/prometheus --config.file=/etc/prometheus/prometheus.yml

  6. Check if the Docker service is running:

Start the prometheus container inside Prometheus Server

6. Configure data sources in Amazon Managed Grafana

To configure your data sources, complete the following steps:

  1. Log in to the Amazon Managed Grafana URL.
  2. Choose AWS Data Services in the navigation pane, then choose Data Sources.

Amazon Managed Grafana - Configuring data source

  1. For Service, choose Amazon Managed Service for Prometheus.
  2. For Region, choose your Region.

The correct resource ID is populated automatically.

  1. Select your resource ID and choose Add 1 data source.

Configuring data source for Amazon managed Prometheus

  1. Choose Go to settings.

Configuring data source for Amazon managed Prometheus

  1. For Name, enter Amazon Managed Prometheus and enable Default.

The URL is automatically populated.

  1. Leave everything else as default.

Configuring data source for Amazon managed Prometheus

  1. Choose Save & Test.

If everything is correct, the message Data source is working appears.

Now we configure CloudWatch as a data source.

  1. Choose AWS Data Services, then choose Data source.
  2. For Services, choose CloudWatch.
  3. For Region, choose your correct Region.
  4. Choose Add data source.
  5. Select the CloudWatch data source and choose Go to settings.

Configuring data source with Service as CloudWatch

  1. For Name, enter AmazonMSK-CloudWatch.
  2. Choose Save & Test.

7. Import the Grafana dashboard

You can use the following preconfigured dashboards, which are available to download from the GitHub repo:

  • Kafka Metrics
  • MSK Cluster Overview
  • AWS MSK – Kafka Cluster-CloudWatch

To import your dashboard, complete the following steps:

  1. In Amazon Managed Grafana, choose the plus sign in the navigation pane.
  2. Choose Import.

Importing preconfigured dashboard

  1. Choose Upload JSON file.
  2. Choose the dashboard you downloaded.
  3. Choose Load.

Configuring dashboard by importing the json file

The following screenshot shows your loaded dashboard.

Dashboard showing cluster level metrics

Generate sample data in Amazon MSK (Optional – when you create a new Amazon MSK Cluster)

To generate sample data in Amazon MSK, complete the following steps:

  1. In your AWS Cloud9 environment, log in to the Kafka client.
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Kafka Client>

  2. Set the broker endpoint variable
    export MYBROKERS="<Bootstrap Servers Endpoint – captured earlier>"

  3. Run the following command to create a topic called TLSTestTopic60:
    /home/ec2-user/kafka/bin/kafka-topics.sh --command-config /home/ec2-user/kafka/config/client.properties --bootstrap-server $MYBROKERS --create --topic TLSTestTopic60 --partitions 5 --replication-factor 2

  4. Still logged in to the Kafka client, run the following command to start the producer service:
    /home/ec2-user/kafka/bin/kafka-console-producer.sh --producer.config /home/ec2-user/kafka/config/client.properties --broker-list $MYBROKERS --topic TLSTestTopic60

Creating topic and producing messages in Kafka Producer service

  1. Open a new terminal from within your AWS Cloud9 environment and log in to the Kafka client instance
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Kafka Client>

  2. Set the broker endpoint variable
    export MYBROKERS="<Bootstrap Servers Endpoint – captured earlier>"

  3. Now you can start the consumer service and see the incoming messages
    /home/ec2-user/kafka/bin/kafka-console-consumer.sh --consumer.config /home/ec2-user/kafka/config/client.properties --bootstrap-server $MYBROKERS --topic TLSTestTopic60 --from-beginning

Starting Kafka consumer service

  1. Press CTRL+C to stop the producer/consumer service.

Kafka metrics dashboards on Amazon Managed Grafana

You can now view your Kafka metrics dashboards on Amazon Managed Grafana:

  • Cluster overall health – Configured using Amazon Managed Service for Prometheus as the data source:
    • Critical metrics

Amazon MSK cluster overview – Configured using Amazon Managed Service for Prometheus as the data source:

  • Critical metrics
  • Cluster throughput (broker-level metrics)

  • Cluster metrics (JVM)

Kafka cluster operation metrics – Configured using CloudWatch as the data source:

  • General overall stats

  • CPU and Memory metrics

Clean up

You will continue to incur costs until you delete the infrastructure that you created for this post. Delete the CloudFormation stack you used to create the respective resources.

If you used an existing cluster, make sure to remove the inbound rules you updated in the security group (otherwise the stack deletion will fail).

  1. On the Amazon MSK console, navigate to your existing cluster.
  2. On the Properties tab, in the Networking settings section, open the security group you applied.
  3. Choose Edit inbound rules.

  1. Choose Delete to remove the rules you added.
  2. Choose Save rules.

Now you can delete your CloudFormation stacks.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select ManagedPrometheusAndGrafana and choose Delete.
  3. If you used an existing Amazon MSK cluster, delete the stack Prometheus.
  4. If you created a new Amazon MSK cluster, delete the stack MSKDemo.

Conclusion

This post showed how you can deploy a fully managed, highly available, scalable, and secure monitoring system for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana, and use Grafana dashboards to gain deep insights into various operational metrics. Although this post only discussed using Amazon Managed Service for Prometheus and CloudWatch as the data sources in Amazon Managed Grafana, you can enable various other data sources, such as AWS IoT SiteWise, AWS X-Ray, Redshift, and Amazon Athena, and build a dashboard on top of those metrics. You can use these managed services for monitoring any number of Amazon MSK platforms. Metrics are available to query in Amazon Managed Grafana or Amazon Managed Service for Prometheus in near-real time.

You can use this post as prescriptive guidance and deploy an observability solution for a new or an existing Amazon MSK cluster, identify the metrics that are important for your applications and then create a dashboard using Amazon Managed Grafana and Prometheus.


About the Authors

Anand Mandilwar is an Enterprise Solutions Architect at AWS. He works with enterprise customers helping customers innovate and transform their business in AWS. He is passionate about automation around Cloud operation , Infrastructure provisioning and Cloud Optimization. He also likes python programming. In his spare time, he enjoys honing his photography skill especially in Portrait and landscape area.

Ajit Puthiyavettle is a Solution Architect working with enterprise clients, architecting solutions to achieve business outcomes. He is passionate about solving customer challenges with innovative solutions. His experience is with leading DevOps and security teams for enterprise and SaaS (Software as a Service) companies. Recently he is focussed on helping customers with Security, ML and HCLS workload.

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 1

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-1/

Unstructured data can make up to 80 percent of data in the day-to-day business of financial organizations. For example, these organizations typically store and read PDFs and images for claim processing, underwriting, and know your customer (KYC). Organizations need to make this ingested data accessible and searchable across different entities while logically separating data access according to role requirements.

In this two-part series, we use AWS services to build an end-to-end content repository for storing and processing unstructured data with the following features:

  • Dynamic access control-based logic over unstructured data
  • Multilingual semantic search capabilities

In part 1, we build the architectural foundation for the content repository, including the resource access control logic and a web UI to upload and list documents.

Solution overview

The content repository includes four building blocks:

Frontend and interaction: For this function, we use AWS Amplify, which is a set of purpose-built tools and features to help frontend web and mobile developers quickly build full-stack applications on AWS. The React application uses the AWS Amplify authentication feature to quickly set up a complete authentication flow integrated into Amazon Cognito. Amplify also hosts the frontend application.

Authentication and authorization: Implementing dynamic resource access control with a combination of roles and attributes is fundamental to your content repository security. Amazon Cognito provides a managed, scalable user directory, user sign-up and sign-in flows, and federation capabilities through third-party identity providers. We use Amazon Cognito user pools as the source of user identity for the content repository. You can work with user pool groups to represent different types of user collection, and you can manage their permissions using a group-associated AWS Identity and Access Management (IAM) role.

Users authenticate against the Amazon Cognito user pool. The web app will exchange the user pool tokens for AWS credentials through an Amazon Cognito identity pool in the content repository. You can complement the IAM role-based authorization model by mapping your relevant attributes to principal tags that will be evaluated as part of IAM permission policies. This allows a dynamic and flexible authorization strategy. For use cases that need federation with third-party identity providers, you can base your user collection on existing user group attributes, such as Active Directory group membership.

Backend and business logic: Authenticated users are redirected to the Amazon API Gateway. API Gateway provides managed publishing for application programming interfaces (APIs) that act as the repository’s “front door.” API Gateway also interacts with the repository’s backend through RESTful APIs. This makes the business logic of the content repository extensible for future use cases, such as transcription and translation. We use AWS Lambda as a serverless, event-driven compute service to run specific business logic code, such as uploading a document to the content repository.

Content storage: Amazon Simple Storage Service (Amazon S3) provides virtually unlimited scalability and high durability. With Amazon S3, you can cost-effectively store unstructured documents in their native formats and make it accessible in a secure and scalable way. Enriching the uploaded documents with tags simplifies data governance with fine-grained access control.

Technical architecture

The technical architecture of the content repository with these four components can be found in Figure 1.

Technical architecture of the content repository

Figure 1. Technical architecture of the content repository

Let’s explore the architecture step by step.

  1. The frontend uses the Amplify JS library to add the authentication UI component to your React app, allowing authenticated users to sign in.
  2. Once the user provides their sign-in credentials, they are redirected to Amazon Cognito user pools to be authenticated.
  3. Once the authentication is successful, Amazon Cognito invokes a pre-token generation Lambda function. This function customizes the identity (ID) token with a new claim called department. This new claim is the Amazon Cognito group name from the cognito:preferred_role claim.
  4. Amazon Cognito returns the identity, access, and refresh token in JSON format to the frontend.
  5. The Amplify client library stores the tokens and handles refreshes using the refresh token while the React frontend application calls the API Gateway with the ID token. Note: Usually, you would use the access token to grant access to authorized resources. For this architecture, we use the ID token because we have enriched it with the custom claim during step 3.
  6. API Gateway uses its native integration with Amazon Cognito and validates the ID token’s signature and expiration using Amazon Cognito user pool authorizer. For more complex authorization scenarios, you can use API Gateway Lambda authorizer with the AWS JSON Web Token (JWT) Verify library for verifying JWTs signed by Amazon Cognito.
  7. After successful validation, API Gateway passes the ID token to the backend Lambda function, which can verify and authorize upon it for access control.
  8. Upon document upload action, the backend Lambda function calls the Amazon Cognito identity pool to exchange the ID token for the temporary AWS credentials associated with the cognito:preferred_role claim.
  9. The document upload Lambda function returns a pre-signed URL with the custom department claim in the Amazon S3 path prefix as well as the object tag. The Amazon S3 pre-signed URL is used for the document upload from the frontend application directly to Amazon S3.
  10. Upon document list action, similar to step 8, the backend Lambda function exchanges the ID token for the temporary AWS credentials. The Lambda function returns only the documents based on the user’s preferred group and associated custom department claim.

Prerequisites

You must have the following prerequisites for this solution:

Walkthrough

Setup

The following steps will deploy two AWS CDK stacks into your AWS account:

  • content-repo-stack (blog-content-repo-stack.ts) creates the environment detailed in Figure 1.
  • demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

  1. Clone the project git repository:
    git clone https://github.com/aws-samples/content-repository-with-dynamic-access-control content-repository
  2. Install the necessary dependencies:
    cd content-repository/backend-cdk
    npm install
    
  3. Configure environment variables:
    export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
    export CDK_DEFAULT_REGION=$(aws configure get region)
    
  4. Bootstrap your account for AWS CDK usage:
    cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION
  5. Deploy the code to your AWS account:
    cdk deploy --all

Using the repository

Once you deploy the CDK stacks in your AWS account, follow these steps:

1. Access the frontend application:

a. Copy the amplifyHostedAppUrl value shown in the AWS CDK output from the content-repo-stack.

b. Use the URL with your web browser to access the frontend application.

c. A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.

2. Application sign-in and role-based access control (RBAC):

a. The React webpage prompts you to sign in and then change the temporary password.

b. The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. In this walkthrough, we use the sales-user user, which belongs to the sales department group to validate RBAC.

3. Upload a document to the content repository:

a. Authenticate as sales-user.

b. Select upload to upload your first document to the content repository.

c. The repository provides sample documents in the assets sub-folder.

4. List your uploaded document:

a. Select list to show the uploaded sales content.

b. To verify the dynamic access control, repeat steps 2 and 3 for the marketing-user user, which belongs to the marketing department group.

c. Sign-in to the AWS Management Console and navigate to the Amazon S3 bucket with the prefix content-repo-stack-s3sourcebucket to confirm that all the uploaded content exists.

Implementation notes

Frontend deployment and cross-origin access

The content-repo-stack contains an AwsCustomResource construct. This construct uses the Amplify API to start the release job of the Amplify hosted frontend application. The preBuild step of the Amplify application build specification dynamically configures its backend for the Amazon Cognito-based authentication. The required Amazon Cognito configuration parameters are retrieved from the AWS Systems Manager Parameter Store during build time. Similarly, the Amplify application postBuild step updates the Amazon S3 cross-origin resource sharing (CORS) rule for the Amazon S3 bucket to only allow cross-origin access from the Amplify-hosted URL of the frontend application.

Application sign-in and access control

The Amazon Cognito identity pool configuration is set to Choose role from token for authenticated users, as in Figure 2. This setup permits authenticated users to pass the roles in the ID token that the Amazon Cognito user pool assigned. Backend Lambda functions use the roles that appear in the cognito:roles and cognito:preferred_role claims in the ID token for RBAC.

Amazon Cognito identity pool configuration – using tokens to assign roles to authenticated users

Figure 2. Amazon Cognito identity pool configuration – using tokens to assign roles to authenticated users

In the attributes for access control section, we configured a custom mapping from the augmented department token claim to a tag key, as in Figure 3. The backend logic uses the tag key to match the PrincipalTag condition in IAM policies to control access to AWS resources.

Amazon Cognito identity pool configuration – custom mapping from claim names to tag keys

Figure 3. Amazon Cognito identity pool configuration – custom mapping from claim names to tag keys

Document upload

The presigned_url.py Lambda function generates a pre-signed Amazon S3 URL using the department token claim as the key. This function automatically organizes the uploaded document into a logical structure in the Amazon S3 source bucket. Accordingly, the cognito:preferred_role used for the Amazon S3 client credentials in the Lambda function has a permission policy using the PrincipalTag department to dynamically limit access to the Amazon S3 key, as in Figure 4.

Permission policy using PrincipalTag to upload documents to Amazon S3

Figure 4. Permission policy using PrincipalTag to upload documents to Amazon S3

Document listing

The list functionality only shows the uploaded content belonging to the preferred group of authenticated Amazon Cognito user pool user. To only list the files that a specific user (for example, sales-user) has access to, use the PrincipalTag s3:prefix condition, as in Figure 5.

Permission policy using s3:prefix condition with session tags to list documents

Figure 5. Permission policy using s3:prefix condition with session tags to list documents

Cleanup

In the backend-cdk subdirectory, delete the deployed resources:

cdk destroy --all

Conclusion

In this blog, we demonstrated how to build a content repository with an easy-to-use web application for unstructured data that ingests documents while maintaining dynamic access control for users within departments. These steps provide a foundation to build your own content repository to store and process documents. As next steps, based on your organization’s security requirements, you can implement more complex access control use cases by balancing IAM role and principal tags. For example, you can use Amazon Cognito user pool custom attributes for additional dimensions such as document “clearance” with optional modification in the pre-token generation Lambda.

In the next part of this blog series, we will enrich the content repository with multi-lingual semantic search features while maintaining the access control fundamentals we’ve already implemented. For additional information on how you can build a solution to search for information across multiple scanned documents, PDFs, and images with compliance capabilities, please refer our Document Understanding Solution from AWS Solutions Library.

Proactive Insights with Amazon DevOps Guru for RDS

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/proactive-insights-with-amazon-devops-guru-for-rds/

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

  • Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.
  • Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.
  • Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Long InnoDB History for Aurora MySQL-Compatible engines Insight overview

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Temporary tables created on disk insight overview

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Temporary tables created on disk evolution of the amount of temporary tables created on disk per second

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Temporary tables created on disk the percentage of temporary tables on disk (out of the total number of database-created temporary tables)

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Temporary tables created on disk overall rate at which the temporary tables are created (per second)

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Temporary tables created on disk actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Considerations for the security operations center in the cloud: deployment using AWS security services

Post Syndicated from Stuart Gregg original https://aws.amazon.com/blogs/security/considerations-for-the-security-operations-center-in-the-cloud-deployment-using-aws-security-services/

Welcome back. If you’re joining this series for the first time, we recommend that you read the first blog post in this series, Considerations for security operations in the cloud, for some context on what we will discuss and deploy in this blog post. In the earlier post, we talked through the different operating models (centralized, decentralized, or hybrid) that you can deploy for a Security Operations Center (SOC) function when you operate in the cloud. We covered the advantages of each model and some of the potential drawbacks you might see when you start to scale up operations within the cloud.

This post will focus on the Amazon Web Services (AWS) native security service, AWS Security Hub, that you can use to deploy in different SOC operating models. AWS Security Hub is a cloud security posture management service that SOC teams can use to perform security best practice checks and aggregate alerts. AWS Security Hub accepts findings from multiple sources, whether native to AWS, from the pre-built integrations, or from your own sources converted into the AWS Security Finding Format (ASFF). The data collected in Security Hub facilitates response and remediation actions.

Although the models we describe here use services that are native to AWS, the reference architectures that correspond to each operating model can be applied to a variety of deployments, including multi-cloud and traditional on-premises deployments. The majority of this post will focus on the decentralized and hybrid models—the centralized model is well documented and has reference architectures already available for you today.

Each organization is different, and no one operating model will fit everyone. You should choose the model that works best for your organizational landscape, with an understanding that the landscape will change and evolve over time. Using feedback loops and being open to change is important to help you meet the continued needs of your business. Additional factors to consider include, but are not limited to: staff skills, compliance requirements, previous operating model, and budget.

The centralized model

The centralized operating model for the SOC is well documented and frequently discussed, both at AWS and in the security community. According to AWS best practices, typically you designate a central security tooling account that is dedicated to operating security services, monitoring AWS accounts, and automating security alerting and response. The security tooling account serves as the administrator account for security services that are managed in an administrator/member structure across your AWS accounts. The key objectives for establishing a security tooling account are the following:

  • Provide a dedicated enclave with controlled access for managing security guardrails, monitoring, and response.
  • Maintain the appropriate centralized security infrastructure to monitor security operations data and maintain traceability across the security lifecycle.

Figure 1 demonstrates the variety of AWS security services that you can deploy in the central security account. For example, Security Hub within the security tooling account can act as the administrator to enable Security Hub in the member accounts, as well as view findings, view insights, and set security standards across member accounts, which can help simplify security posture management across your existing and future accounts.

Figure 1: Reference architecture for the security tooling account in a centralized model

Figure 1: Reference architecture for the security tooling account in a centralized model

As mentioned earlier, you can enable Security Hub to administer and enable member accounts. This is achieved by using AWS Organizations and the delegated administrator functionality. In addition, you can use Security Hub cross-Region aggregation within the delegated administrator account to aggregate findings, finding updates, insights, control compliance statuses, and security scores from multiple Regions to a single aggregation Region. You can then manage this data from the aggregation Region. Figure 2 shows the reference architecture for this functionality.

Figure 2: Reference architecture for Security Hub in the delegated administrator model

Figure 2: Reference architecture for Security Hub in the delegated administrator model

The AWS Security Reference Architecture (AWS SRA) is a great starting point for establishing the centralized security operations model. The AWS SRA is a holistic set of guidelines for deploying the full complement of AWS security services in a multi-account environment. You can use it to help design, implement, and manage AWS security services so that they align with AWS best practices. The AWS SRA’s Security Hub Organization solution provides deployable templates and examples that automate the process of enabling Security Hub by delegating administration to an account and configuring Security Hub for the existing and future AWS Organizations accounts.

The decentralized and hybrid models

As mentioned in Considerations for security operations in the cloud, the decentralized and hybrid SOC models provide many benefits for organizations. The flexibility of these operating models allows organizational units (OUs) to control how they deal with security-related incidents while still having organization-wide visibility into security posture. This flexibility is important as organizations start to scale up activities within the cloud.

The reference architecture in Figure 3 shows how the benefits we discussed in our earlier blog post can be architected in the decentralized and hybrid operating models in the AWS Cloud.

Figure 3: Reference architecture for the decentralized and hybrid operating models in AWS

Figure 3: Reference architecture for the decentralized and hybrid operating models in AWS

The key features of this architecture are as follows:

  1. The organization root account is separate, according to AWS Organizations best practices. By using service control policies (SCPs), the root account can still achieve a level of governance across the business.
  2. Dedicated accounts have been created for each OU for the Security Hub administration. The model we will use for this deployment is the invite model. In this reference architecture and as an example, we’re using Amazon GuardDuty to flow findings into Security Hub. When you use this model, each OU can manage findings for that OU. This gives you flexibility to work from the Security Hub admin with full visibility of the OU and accounts associated with that OU, or to work in each member account and view findings for that account only.
  3. (Optional, for use with the hybrid model) Each OU’s Security Hub member accounts first send events to their Security Hub admin account. The Security Hub admin account will then send events for that OU to the local Amazon EventBridge bus. You can then set up rules to forward events to a central EventBridge bus in a dedicated AWS account. In the architecture in Figure 3, this account is named SecAnalytics. This step will follow a similar flow as the one described in this AWS Cloud Operations & Migrations blog post.
  4. (Optional, for use with the hybrid model) After the OUs have sent data to the central bus, you can use a capability similar to the one in this AWS Architecture Blog post to start organizing the findings and gain organization-wide visibility. The solution in the earlier post used Amazon QuickSight to visualize the data, but you can use another tool or pre-existing data pipeline.

Items 3 and 4 labeled with (Optional) are capabilities that enable the hybrid model; these are not required if you only want to enable the decentralized model.

Considerations for all deployments

Keep the following considerations in mind for all deployments:

  • Steady state operations should be considered for whichever model you deploy in. For the centralized model, you can use functionality within AWS Organizations to automatically enable Security Hub for accounts within the organization. In the decentralized and hybrid models, you will need to build out this capability or use a similar capability as described in this repo.
  • Alert fatigue happens when humans work on the same repetitive tasks’ day in and day out. To help reduce this, within the reference architecture and solution overview, we’ve added the capability described in this Security Blog post to automatically suppress findings based on criteria set by you. For the centralized model, you can add this capability in the delegated admin account for Security Hub. For the decentralized and hybrid models, we recommend that you put the auto-suppression capability in the Security Hub admin account, and then centralize the rules for suppression for that OU at the Security Hub admin level. This will reduce the overhead for deploying suppression rules multiple times and give a single location where rules are placed for that OU.
  • Context is key. Within the reference architecture and solution overview for decentralized and hybrid deployments, we’ve added the capability described in this Security Blog post. This capability will add additional context, such as the account name, the OU associated with the account, security contact information, and account tags. This information is pulled from AWS Organizations to enrich Security Hub findings. This additional context can also be used in the centralized model.

Deploy the decentralized and hybrid models

In this section, we’ll walk you through the deployment that reflects the reference architecture for the decentralized and hybrid models. Figure 4 shows the solution architecture, including the solution that needs to be deployed in the Security Hub admin account and in the aggregation Region for each business unit within the organization. The solution provides the capability to suppress Security Hub findings, enrich the findings, and propagate findings to central security accounts.

Figure 4: Reference architecture for the decentralized and hybrid deployment

Figure 4: Reference architecture for the decentralized and hybrid deployment

The solution architecture consists of the following:

  • An EventBridge rule to invoke a Lambda function (Suppression Lambda) as the target to suppress any findings based on specific generator IDs within specific member accounts.

    Note: The Security Hub Generator IDs and AWS Account IDs in the EventBridge rule are left as placeholders so that you can fill based on your needs.

  • An EventBridge rule to invoke a Lambda function (Enrichment Lambda) as the target to enrich the findings with AWS account and OU related metadata, along with alternate contact information to better prioritize the findings. The API calls to AWS Organizations and AWS account management services are optimized by caching the metadata in an Amazon DynamoDB table with a time-to-live (TTL) value of 24 hours.
  • An EventBridge rule to post the enriched findings that were not suppressed to a custom EventBridge event bus in the organization-level Security Tooling/SecAnalytics account.

Prerequisites

The following are the prerequisites for this deployment:

  • AWS Organizations is utilized across the business. In this scenario, AWS Organizations will be used to group AWS accounts into OUs, as well as to provide enrichment data for Security Hub findings.
  • Alternative contacts for AWS accounts have been filled out with the most up-to-date information. This is a best practice recommendation. This information will be used for enrichment of the Security Hub findings.
  • Your organization already has a pipeline in place for indexing Security Hub findings and visualizing them.
  • Security Hub is set up in the invite model. OU-level Security Hub accounts have been invited and accepted to be managed by the OU-level Security Hub admin account.
  • The grouping of findings across multiple OU-level Security Hub admin accounts uses Amazon EventBridge to forward events to a centralized bus. You should have the event bus set up ready for this deployment.

Deploy the solution

This solution deployment consists of two parts:

  1. Create an IAM role in your Organizations management account that allows BU-level Security Hub admin to access account metadata, as described in the Create the IAM role procedure that follows.
  2. Deploy the Enrichment Lambda function, the Suppression Lambda function, and the associated EventBridge event rules within the BU-level Security Hub administrator account.

Create the IAM role

Follow the instructions in Creating a role to delegate permissions to an IAM user to create an IAM role by using the IAM console, AWS Command Line Interface (AWS CLI), or AWS API. Create the role in the AWS Organizations management account with the role name as account-contact-readonly, based on the following trust and permission policy templates. You will need the account ID of your BU-level Security Hub administrator account.

The IAM trust policy allows the Security Hub administrator account to assume the role in your Organizations management account.

Note: The following trust policy shows only one BU Security admin account. You will need to add all BU Security admin accounts to the trust policy.

IAM role trust policy

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Principal": {
         "AWS": "arn:aws:iam::<BU SecHubAdmin Account ID>:root"
       },
       "Action": "sts:AssumeRole",
       "Condition": {}
     }
   ]
 }

Note: Replace <BU SecHubAdmin Account ID> with the account ID of your decentralized BU-level Security Hub administrator account. After the solution is deployed, you should update the principal in the preceding trust policy to use the new IAM role created for the solution.

IAM permission policy

{
     "Version": "2012-10-17",
     "Statement": [
         {
            "Action": "Account:GetAlternateContact",
            "Resource": [
                "arn:aws:account::<Org Management Account ID>:account/o-*/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "organizations:DescribeAccount",
                "organizations:ListTagsForResource",
                "organizations:DescribeOrganizationalUnit",
                "organizations:ListParents"
            ],
            "Resource": [
                "arn:aws:organizations::<Org Management Account ID>:account/o-*/*",
                "arn:aws:organizations::<Org Management Account ID>:ou/o-*/ou-*"
            ],
            "Effect": "Allow"
        }
     ]
 }

The IAM permission policy allows the Security Hub administrator account to look up the alternate contact information for the member accounts.

Make a note of the role Amazon Resource Name (ARN) for the IAM role, which will be similar to this format:
arn:aws:iam::<Org Management Account ID>:role/account-contact-readonly.

You will need this ARN when you deploy the solution in the next procedure.

Use AWS CloudFormation to create the IAM role

Alternatively, you can use the CloudFormation template we provide in our GitHub repository to create the role in the management account. The IAM role ARN is available in the Outputs section of the created CloudFormation stack.

Deploy the solution to your BU-level Security Hub administrator account

After you have the IAM role created, you can deploy the solution either from the AWS Management Console, or from our GitHub repository by using the AWS SAM CLI.

Note: If you’ve designated an aggregation Region within the BU-level Security Hub administrator account, you can deploy this solution only in the aggregation Region. Otherwise, you need to deploy this solution separately in each Region of the BU-level Security Hub administrator account where Security Hub is enabled.

To deploy the solution by using the AWS Management Console

  1. In your Security Hub administrator account, launch the template by choosing the following Launch Stack button, which creates the stack the in us-east-1 Region.

    Launch Stack stack

    Note: If your Security Hub aggregation Region is different than us-east-1 or you want to deploy the solution in a different AWS Region, you can deploy the solution from the GitHub repository described in the next section.

  2. On the Quick create stack page, for Stack name, enter a unique stack name for this account; for example, aws-security-hub-decentralized-deployment-stack
     
    Figure 5: Quick create CloudFormation stack for the solution

    Figure 5: Quick create CloudFormation stack for the solution

  3. For SecurityToolingAccountEventBus, provide the EventBus ARN in the security tooling account to post the Security Hub findings from the BU-level Security Hub administrator account.
  4. For OrgManagementAccountContactRole, enter the role ARN of the role you created previously in the Create IAM role procedure.
  5. Choose Create stack.
  6. After the stack is created, go to the Resources tab and take note of the name of the IAM role that was created.
  7. Update the principal element of the IAM role trust policy that you previously created in the Organizations management account in the Create the IAM role procedure, replacing the existing value with the role name you noted down.

To deploy the solution from our GitHub repository and AWS SAM CLI

  1. Install the AWS SAM CLI.
  2. Download or clone the GitHub repository by using the following commands.

    git clone https://github.com/aws-samples/aws-securityhub-decentralized-operations-solution.git
    cd aws-securityhub-decentralized-operations-solution

  3. Update the content of the profile.txt file with the profile name you want to use for the deployment.
  4. To create a new bucket for deployment artifacts, run create-bucket.sh by specifying the Region as argument.

    $ ./create-bucket.sh us-east-1

  5. Deploy the solution to the account by running the deploy.sh script by specifying the Region as argument.

    $ ./deploy.sh us-east-1

  6. After the stack is created, go to the Resources tab and take note of the name of the IAM role that was created.
  7. Update the principal element of the IAM role trust policy that you previously created in the Organizations management account in the Create the IAM role procedure, replacing it with the role name you noted down.

    "AWS": "arn:aws:iam::<BU SH Delegated Account ID>: role/<Role Name>"

Note: The EventBridge rule to invoke the findings suppression Lambda function uses placeholders for the generator IDs and AWS account IDs. You need to update the EventBridge rule to meet your specific organizational requirements.

Further enhancements and conclusion

Beyond what is described in the decentralized and hybrid models, you can extend the solution to include the following aspects to meet your security operational needs:

  • In Considerations for security operations in the cloud, we spoke about the role of ChatOps. AWS Chatbot can enable OUs to set up rules to post notifications directly into chat rooms such as Amazon Chime or Slack. You can define rules to send only certain severity notifications or findings that are important to your OU to the chat room.
  • SCPs give organizations a level of control and governance. See this blog post for some best practices for deploying SCPs, as well as example policies that could be beneficial for your organization in any model you operate in.
  • We’ve performed testing of the decentralized and hybrid models in the reference architecture within one AWS Region. Although we don’t see any reason why this solution would not work in multiple Regions, if you do operate in multiple Regions you would need to deploy the CloudFormation template in each Region that you operate in. At this stage, you can keep findings within a Region or choose to centralize across multiple Regions by sending to the single central bus in Amazon EventBridge—the flexibility is yours.
  • The decentralized and hybrid models can also be extended if you operate in multiple organizations in AWS Organizations or have standalone accounts outside of your organization that you want to monitor. Interesting use cases could be in mergers and acquisitions scenarios, when newly acquired accounts need to be monitored to understand their posture before bringing them fully into the organization.

Throughout this two-part blog series, we’ve explored the role of the Security Operations Center (SOC) function, both traditionally in an on-premises environment and in the cloud. We’ve explored different operating models, from the traditional centralized deployment to the decentralized and hybrid models. We’ve also demonstrated, with reference architectures and deployable solutions, how you can achieve the different operating models in the AWS Cloud by using native AWS services. In the end, you should choose the model that works best for your environment and the security landscape you work in.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Stuart Gregg

Stuart Gregg

Stuart enjoys providing thought leadership and being a trusted advisor to customers. In his spare time Stuart can be seen either training for an Ironman or snacking.

Author

Siva Rajamani

Siva is a Boston-based Enterprise Solutions Architect. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security.

Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/simplify-online-analytical-processing-olap-queries-in-amazon-redshift-using-new-sql-constructs-such-as-rollup-cube-and-grouping-sets/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools.

We are continuously investing to make analytics easy with Redshift by simplifying SQL constructs and adding new operators. Now we are adding ROLLUP, CUBE, and GROUPING SETS SQL aggregation extensions to perform multiple aggregate operations in single statement and easily include subtotals, totals, and collections of subtotals in a query.

In this post, we discuss how to use these extensions to simplify your queries in Amazon Redshift.

Solution overview

Online Analytical Processing (OLAP) is an effective tool for today’s data and business analysts. It helps you see your mission-critical metrics at different aggregation levels in a single pane of glass. An analyst can use OLAP aggregations to analyze buying patterns by grouping customers by demographic, geographic, and psychographic data, and then summarizing the data to look for trends. This could include analyzing the frequency of purchases, the time frames between purchases, and the types of items being purchased. Such analysis can provide insight into customer preferences and behavior, which can be used to inform marketing strategies and product development. For example, a data analyst can query the data to display a spreadsheet showing a company’s certain type of products sold in the US in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in the US at the same time period.

Traditionally, business analysts and data analysts use a set of SQL UNION queries to achieve the desired level of detail and rollups. However, it can be very time consuming and cumbersome to write and maintain. Furthermore, the level of detail and rollups that can be achieved with this approach is limited, because it requires the user to write multiple queries for each different level of detail and rollup.

Many customers are considering migrating to Amazon Redshift from other data warehouse systems that support OLAP GROUP BY clauses. To make this migration process as seamless as possible, Amazon Redshift now offers support for ROLLUP, CUBE, and GROUPING SETS. This will allow for a smoother migration of OLAP workloads, with minimal rewrites. Ultimately, this will result in a faster and streamlined transition to Amazon Redshift. Business and data analysts can now write a single SQL to do the job of multiple UNION queries.

In the next sections, we use sample supplier balances data from TPC-H dataset as a running example to demonstrate the use of ROLLUP, CUBE, and GROUPING SETS extensions. This dataset consists of supplier account balances across different regions and countries. We demonstrate how to find account balance subtotals and grand totals at each nation level, region level, and a combination of both. All these analytical questions can be answered by a business user by running simple single-line SQL statements. Along with aggregations, this post also demonstrates how the results can be traced back to attributes participated in generating subtotals.

Data preparation

To set up the use case, complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor¸ then Query editor v2.

The query editor v2 opens in a new browser tab.

  1. Create a supplier sample table and insert sample data:
create table supp_sample (supp_id integer, region_nm char(25), nation_nm char(25), acct_balance numeric(12,2));

INSERT INTO public.supp_sample (supp_id,region_nm,nation_nm,acct_balance)
VALUES
(90470,'AFRICA                   ','KENYA                    ',1745.57),
(99910,'AFRICA                   ','ALGERIA                  ',3659.98),
(26398,'AMERICA                  ','UNITED STATES            ',2575.77),
(43908,'AMERICA                  ','CANADA                   ',1428.27),
(3882,'AMERICA                  ','UNITED STATES            ',7932.67),
(42168,'ASIA                     ','JAPAN                    ',343.34),
(68461,'ASIA                     ','CHINA                    ',2216.11),
(89676,'ASIA                     ','INDIA                    ',4160.75),
(52670,'EUROPE                   ','RUSSIA                   ',2469.40),
(32190,'EUROPE                   ','RUSSIA                   ',1119.55),
(19587,'EUROPE                   ','GERMANY                  ',9904.98),
(1134,'MIDDLE EAST              ','EGYPT                    ',7977.48),
(35213,'MIDDLE EAST              ','EGYPT                    ',737.28),
(36132,'MIDDLE EAST              ','JORDAN                   ',5052.87);

We took a sample from the result of the following query run on TPC-H dataset. You can use the following query and take sample records to try the SQL statement described in this post:

select s_suppkey supp_id, r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey

Let’s review the sample data before running the SQLs using GROUPING SETS, ROLLUP, and CUBE extensions.

The supp_sample table consists of supplier account balances from various nations and regions across the world. The following are the attribute definitions:

  • supp_id – The unique identifier for each supplier
  • region_nm – The region in which the supplier operates
  • nation_nm – The nation in which the supplier operates
  • acct_balance – The supplier’s outstanding account balance

GROUPING SETS

GROUPING SETS is a SQL aggregation extension to group the query results by one or more columns in a single statement. You can use GROUPING SETS instead of performing multiple SELECT queries with different GROUP BY keys and merge (UNION) their results.

In this section, we show how to find the following:

  • Account balances aggregated for each region
  • Account balances aggregated for each nation
  • Merged results of both aggregations

Run the following SQL statement using GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS (region_nm, nation_nm);

As shown in the following screenshot, the result set includes aggregated account balances by region_nm, followed by nation_nm, and then both results combined in a single output.

ROLLUP

The ROLLUP function generates aggregated results at multiple levels of grouping, starting from the most detailed level and then aggregating up to the next level. It groups data by particular columns and extra rows that represent the subtotals, and assumes a hierarchy among the GROUP BY columns.

In this section, we show how to find the following:

  • Account balances for each combination of region_nm and nation_nm
  • Rolled-up account balances for each region_nm
  • Rolled-up account balances for all regions

Use the following SQL statement using ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm,nation_nm;

The following result shows rolled-up values starting from each combination of region_nm and nation_nm and rolls up in the hierarchy from nation_nm to region_nm. The rows with a value for region_nm and NULL value for nation_nm represent the subtotals for the region (marked in green). The rows with NULL value for both region_nm and nation_nm has the grand total—the rolled-up account balances for all regions (marked in red).


ROLLUP is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), ())
ORDER BY region_nm,nation_nm;

You can rewrite the preceding ROLLUP query using GROUPING SETS. However, using ROLLUP is a much simpler and readable construct for this use case.

CUBE

CUBE groups data by the provided columns, returning extra subtotal rows representing the totals throughout all levels of grouping columns, in addition to the grouped rows. CUBE returns the same rows as ROLLUP, while adding additional subtotal rows for every combination of grouping column not covered by ROLLUP.

In this section, we show how to find the following:

  • Account balance subtotals for each nation_nm
  • Account balance subtotals for each region_nm
  • Account balance subtotals for each group of region_nm and nation_nm combination
  • Overall total account balance for all regions

Run the following SQL statement using CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm, nation_nm;

In the preceding query, we added a filter to limit results for easy explanation. You can remove this filter in your test to view data for all regions.

In the following result sets, you can see the subtotals at region level (marked in green). These subtotal records are the same records generated by ROLLUP. Additionally, CUBE generated subtotals for each nation_nm (marked in yellow). Finally, you can also see the grand total for all three regions mentioned in the query (marked in red).

CUBE is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') -- added the filter to limit results.  You can remove this filter in your test to view data for all regions
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), (nation_nm), ())
ORDER BY region_nm;

You can rewrite the preceding CUBE query using GROUPING SETS. However, using CUBE is a much simpler and readable construct for this use.

NULL values

NULL is a valid value in a column that participates in GROUPING SETS, ROLLUP, and CUBE, and it’s not aggregated with the NULL values added explicitly to the result set to satisfy the schema of returning tuples.

Let’s create an example table orders containing details about items ordered, item descriptions, and quantity of the items:

-- Create example orders table and insert sample records
CREATE TABLE orders(item_no int,description varchar,quantity int);
INSERT INTO orders(item_no,description,quantity)
VALUES
(101,'apples',10),
(102,null,15),
(103,'banana',20);

--View the data
SELECT * FROM orders;

We use the following ROLLUP query to aggregate quantities by item_no and description:

SELECT item_no, description, sum(quantity)
FROM orders
GROUP BY ROLLUP(item_no, description)
ORDER BY 1,2;

In the following result, there are two output rows for item_no 102. The row marked in green is the actual data record in the input, and the row marked in red is the subtotal record added by the ROLLUP function.

This demonstrates that NULL values in input are separate from the NULL values added by SQL aggregate extensions.

Grouping and Grouping_ID functions

GROUPING indicates whether a column in the GROUP BY list is aggregated or not. GROUPING(expr) returns 0 if a tuple is grouped on expr; otherwise it returns 1. GROUPING_ID(expr1, expr2, …, exprN) returns an integer representation of the bitmap that consists of GROUPING(expr1), GROUPING(expr2), …, GROUPING(exprN).

This feature helps us clearly understand the aggregation grain, slice and dice data, and apply filters when business users are performing analysis. Also provides auditability for the generated aggregations.

For example, let’s use the preceding supp_sampe table. The following ROLLUP query utilizes GROUPING and GROUPING_ID functions:

SELECT region_nm,
nation_nm,
sum(acct_balance) as total_balance,
GROUPING(region_nm) as gr,
GROUPING(nation_nm) as gn,
GROUPING_ID(region_nm, nation_nm) as grn
FROM supp_sample
GROUP BY ROLLUP(region_nm, nation_nm)
ORDER BY region_nm;

In the following result set, the rows rolled up at nation_nm have 1 value for gn. This indicates that the total_balance is the aggregated value for all the nation_nm values in the region. The last row has gr value as 1. It indicates that total_balance is an aggregated value at region level including all the nations. The grn is an integer representation of bitmap (11 in binary translated to 3 in integer representation).

Performance assessment

Performance is often a key factor, and we wanted to make sure we’re offering most performant SQL features in Amazon Redshift. We performed benchmarking with the 3 TB TPC-H public dataset on an Amazon Redshift cluster with different sizes (5-node Ra3-4XL, 2-node Ra3-4XL, 2-node-Ra3-XLPLUS). Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache. The results were consistent across multiple runs.

We loaded the supplier, region, and nation files from the 3 TB public dataset and created a view on top of those three tables, as shown in the following code. This query joins the three tables to create a unified record. The joined dataset is used for performance assessment.

create view v_supplier_balances as
select r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey;

We ran the following example SELECT queries using GROUPING SETS, CUBE, and ROLLUP, and captured performance metrics in the following tables.
ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 120 118 117
2-node-Ra3-4XL 405 389 391
2-node-Ra3-XLPLUS 490 460 461

CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 224 215 214
2-node-Ra3-4XL 412 392 392
2-node-Ra3-XLPLUS 872 798 793

GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY GROUPING SETS(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 210 198 198
2-node-Ra3-4XL 345 328 328
2-node-Ra3-XLPLUS 675 674 674

When we ran the same set of queries for ROLLUP and CUBE and ran with UNION ALL, we saw better performance with ROLLUP and CUBE functionality.

Cluster CUBE (run in ms) ROLLUP (run in ms) UNION ALL (run in ms)
5-node-Ra3-4XL 214 117 321
2-node-Ra3-4XL 392 391 543
2-node-Ra3-XLPLUS 793 461 932

Clean up

To clean up your resources, drop the tables and views you created while following along with the example in this post.

Conclusion

In this post, we talked about the new aggregated extensions ROLLUP, CUBE, and GROUPING SETS added to Amazon Redshift. We also discussed general uses cases, implementation examples, and performance results. You can simplify your existing aggregation queries using these new SQL aggregation extensions and use them in future development for building more simplified, readable queries. If you have any feedback or questions, please leave them in the comments section.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Dinesh Kumar is a Database Engineer with more than a decade of experience working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Introducing Client-side Evaluation for Amazon CloudWatch Evidently

Post Syndicated from Cole Thienes original https://aws.amazon.com/blogs/architecture/introducing-client-side-evaluation-for-amazon-cloudwatch-evidently/

Amazon CloudWatch Evidently enables developers to test new features on a small percentage of traffic and gauge the outcome before rolling it out to the rest of their users. Evidently feature flags are defined ahead of your release and, at runtime, your application code queries a remote service to determine whether to show the new feature to a given user. The remote call to fetch feature flags for a user is susceptible to network latency, adding several hundred milliseconds of delay in bad cases. Any additional latency added to fetching feature flags can directly impact the speed of a web page, where milliseconds matter. Our solution: client-side evaluation for Amazon CloudWatch Evidently. With client-side evaluation, developers can significantly decrease latency by fetching feature flags locally and avoiding network overhead altogether.

The term “client-side” does not refer to the browser in this case, but the operation taking place on your container application rather than through a remote API call. This removes the need for a network call by fetching feature flags for a user from the AWS AppConfig agent—a sidecar container running alongside your container application backend. The agent enables container runtimes to leverage AWS AppConfig, a service allowing customers to change the way an application behaves while running without deploying new code. In this post, we’ll walk through the solution architecture and how to instrument client-side evaluation in an Amazon Elastic Container Service (Amazon ECS) application.

Overview of solution

Figure 1 illustrates how client-side evaluation works in an application running on Amazon ECS. The webpage makes a call to the webpage backend to determine which website features to show an end-user. Let’s explore how this works.

High-level architecture of Client-side Evaluation for Amazon CloudWatch Evidently

Figure 1. High-level architecture of client-side evaluation for Amazon CloudWatch Evidently

  1. Create an Evidently project, feature, and launch through the AWS Console Mobile Application, API, or AWS CloudFormation.
  2. Create an Amazon ECS task for the backend application container and attach an AWS AppConfig agent container to the task. At runtime, the application container will invoke the EvaluateFeature API to fetch feature flags. Without client-side evaluation, this API call would perform a remote call to the Evidently cloud service.
  3. With client-side evaluation, the API call is made from the application container to the AWS AppConfig agent container on localhost, short-circuiting the network overhead.
  4. Evidently maintains a synchronized copy of the Evidently features in an AWS AppConfig configuration within your AWS account. When subsequent changes are made to the features, the configuration is updated (usually within a minute).
  5. When the backend application initializes, the agent fetches the necessary configuration profile and caches it, polling periodically to refresh the cache. When the AWS AppConfig agent is invoked from the backend application, it evaluates the requested feature flag using the cached data.
  6. After each successful EvaluateFeature call, a transaction record is generated, called an evaluation event. This useful bookkeeping mechanism helps developers view data to tell which of their users saw what feature and when. As the evaluation events are generated, they are placed in a buffer within the agent. Once the buffer reaches a certain size or age, events in the buffer are uploaded to Evidently via the PutProjectEvents API.
  7. The evaluation events are then available for offline analysis in developer-configured storage, including CloudWatch Logs, Amazon Simple Storage Service (Amazon S3), and CloudWatch Metrics.

Walkthrough

Let’s take a practical example to demonstrate client-side evaluation. I have a simple webpage with a search bar on it. I’ve implemented a newer, fancier search bar, but I only want to show it to 10 percent of my visitors to make sure it doesn’t cause issues on my existing webpage before rolling it out to everyone, as in Figure 2.

Web page experience where 10% of users see the new search bar

Figure 2. A webpage experience where 10% of users see the new search bar

We could set up the necessary AWS resources by hand, but let’s use a pre-built AWS Cloud Development Kit (AWS CDK) example to save time. The sample code for this example is available on GitHub. Here are the high-level steps:

  1. Provision the infrastructure. The infrastructure will consist of:
    • An ECS service with a Virtual Private Cloud (Amazon VPC) to serve as our backend application and return the search bar variation
    • An Evidently launch to split the traffic between the two search bars
    • An AWS AppConfig environment, which the AWS AppConfig agent will fetch Evidently data from
  2. Test our webpage. Once our code is deployed, we will visit our webpage to fetch feature flags using client-side evaluation.
  3.  Clean up by removing our infrastructure.

Prerequisites

Steps

Clone the repository

First, clone the official AWS CDK example repository:

git clone https://github.com/aws-samples/aws-cdk-examples

The repository has many examples for setting up AWS infrastructure in CDK. Let’s go to a client-side evaluation example.

Code explanation

Let’s take a look at the code example. When we visit our web page, a request will be routed to our application deployed on AWS Fargate, which allows us to run containers directly using ECS without having to manage Elastic Compute Cloud (Amazon EC2) instances. The application code is written in Node.js with Typescript and leverages the Express framework:

<p><code>// local-image/app.ts</code></p><p><code>import * as express from 'express';</code><br /><code>import {Evidently} from '@aws-sdk/client-evidently';</code></p><p><code>const app = express();</code></p><p><code>const evidently = new Evidently({</code><br /><code>    endpoint: 'http://localhost:2772',</code><br /><code>    disableHostPrefix: true</code><br /><code>});</code></p><p><code>app.get("/", async (_, res) =&gt; {</code><br /><code>    try {</code><br /><code>        console.time('latency')</code><br /><code>        const evaluation = await evidently.evaluateFeature({</code><br /><code>            project: 'WebPage',</code><br /><code>            feature: 'SearchBar',</code><br /><code>            entityId: 'WebPageVisitor43'</code><br /><code>        })</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(evaluation.variation)</code><br /><code>    } catch (err: any) {</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(err.toString())</code><br /><code>    }</code><br /><code>});</code></p>

The container application will invoke the EvaluateFeature API using the AWS SDK for Javascript and return the search bar variation, either the old or new search bar. Here we also log the latency of the operation. The EvaluateFeature request is forwarded to the endpoint we configure for the Evidently client: http://localhost:2772. This is the local address where the AWS AppConfig agent can be reached. To make this possible, we add the AWS AppConfig agent as a container to the Amazon ECS task definition:

// index.ts

service.taskDefinition.addContainer('AppConfigAgent', {
    image: ecs.ContainerImage.fromRegistry('public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x'),
    portMappings: [{
        containerPort: 2772
    }]
})

We also need to set up an AppConfig environment for the Evidently project. This tells Evidently where to create the configuration to keep a synchronized copy of the features in the project:

// index.ts

const application = new appconfig.CfnApplication(this,'AppConfigApplication', {
    name: 'MyApplication'
});

const environment = new appconfig.CfnEnvironment(this, 'AppConfigEnvironment', {
    applicationId: application.ref,
    name: 'MyEnvironment'
});

const project = new evidently.CfnProject(this, 'EvidentlyProject', {
    name: 'WebPage',
    appConfigResource: {
        applicationId: application.ref,
        environmentId: environment.ref
    }
});

Finally, we set up an Evidently feature and launch that ensures only 10 percent of traffic receives the new search bar:

// index.ts

const launch = new evidently.CfnLaunch(this, 'EvidentlyLaunch', {
  project: project.name,
  name: 'MyLaunch',
  executionStatus: {
    status: 'START'
  },
  groups: [
    {
      feature: feature.name,
      variation: OLD_SEARCH_BAR,
      groupName: OLD_SEARCH_BAR
    },
    {
      feature: feature.name,
      variation: NEW_SEARCH_BAR,
      groupName: NEW_SEARCH_BAR
    }
  ],
  scheduledSplitsConfig: [{
    startTime: '2022-01-01T00:00:00Z',
    groupWeights: [
      {
        groupName: OLD_SEARCH_BAR,
        splitWeight: 90000
      },
      {
        groupName: NEW_SEARCH_BAR,
        splitWeight: 10000
      }
    ]
  }]
})

We start the launch immediately by setting executionStatus to START and startTime to a timestamp in the past. If you want to wait to show the new search bar, we can specify a future start time.

Install dependencies

Install the necessary Node modules:

npm install

Build and deploy

Build the AWS CDK template from the source code:

npm run build

Before deploying the app:

  1. Ensure that you set up AWS credentials in your environment.
  2. The AWS CDK Toolkit is bootstrapped in your AWS account.
  3. Confirm the max number of VPCs has not been reached in your AWS account; the Amazon ECS service will deploy an Amazon VPC.

After that, you can deploy the AWS CDK template to your AWS account:

cdk deploy

Test the webpage

After the previous step, you should see the following output in your console:

Console output showing a successful CDK deployment

Figure 3. Console output showing a successful AWS CDK deployment

In your browser, visit the URL specified by the FargateServiceServiceURL output above and you will see oldSearchBar. We can visit our CloudWatch Logs from the Amazon ECS task to see our application logs. Go to the AWS console and visit the CloudWatch log groups page and visit the log group with the prefix EvidentlyClientSideEvaluationEcs. There, we can see that fetching feature flags took under two milliseconds, as in Figure 4.

CloudWatch Logs for the Amazon ECS task showing an EvaluateFeature latency of 1.238 milliseconds

Figure 4. CloudWatch Logs for the Amazon ECS task showing an EvaluateFeature latency of 1.238 milliseconds

Additionally, we can see how visitors have seen each version of the search bar. On the AWS console, visit the CloudWatch metrics page and see the Evidently metrics under All > Evidently > Feature, Project, Variation, as in Figure 5:

CloudWatch metrics showing the VariationCount: the number of times each feature flag variation was fetched

Figure 5. CloudWatch metrics showing the VariationCount (the number of times each feature flag variation was fetched)

We can increase or decrease the percentage of visitors seeing the new search bar at any time. On the AWS console, go to the CloudWatch Evidently page and go to Projects > WebPage > Launches > MyLaunch > Modify launch traffic and adjust the Traffic percentage, as in Figure 6.

Adjusting the traffic percentage of an Evidently launch

Figure 6. Adjusting the traffic percentage of an Evidently launch

Cleaning up

To avoid incurring future charges, delete the resources. Let’s run:

cdk destroy

You can confirm the removal by going into CloudFormation and confirming the resources were deleted.

Conclusion

In this blog post, we learned how to set up a web page backend in Amazon ECS with client-side evaluation for Amazon CloudWatch Evidently. We easily deployed the example CloudFormation stack with AWS CDK Toolkit. Then we visited the example webpage and demonstrated the improved speed of fetching feature flags with client-side evaluation. If you’re interested in using client-side evaluation with AWS Lambda instead of Amazon ECS, check out this AWS CDK example.

Securely validate business application resilience with AWS FIS and IAM

Post Syndicated from Dr. Rudolf Potucek original https://aws.amazon.com/blogs/devops/securely-validate-business-application-resilience-with-aws-fis-and-iam/

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary. 

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy --stack-name test-fis-api-faults --template-file template.yaml --capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject  policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Image of parameter input field reflecting the instructions in blog text.

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ASSUME_ROLE_NAME}'].Arn" --output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution \
  --document-name FISAPI-IamAttachDetach \
  --parameters "{
      \"AutomationAssumeRole\": [ \"${ASSUME_ROLE_ARN}\" ],
      \"Duration\": [ \"PT2M\" ],
      \"TargetResourceDenyPolicyArn\": [\"${POLICY_ARN}\" ],
      \"TargetApplicationRoleName\": [ \"${ROLE_NAME}\" ]
    }"

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Visual representation of IAM permission used for fault injections with SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the SSM user needing to have a pass-role permission to grant the SSM execution role to the SSM service.

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
             - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "ssm.amazonaws.com"

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

          - Sid: GetRoleAndPolicyDetails
            Effect: Allow
            Action:
              - 'iam:GetRole'
              - 'iam:GetPolicy'
              - 'iam:ListAttachedRolePolicies'
            Resource:
              # Roles
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn
              # Policies
              - !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

  1. an ALLOW statement
  2. a constraint on the policies that can be attached
  3. a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the '*' resource to only the explicit role ARNs we want to modify:

          - Sid: AllowOnlyTargetResourcePolicies
            Effect: Allow
            Action:  
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            Resource: '*'
            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Policies that can be attached
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject
          - Sid: DenyAttachDetachAllRolesExceptApplicationRole
            Effect: Deny
            Action: 
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            NotResource: 
              # Roles that can be attached to
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "fis.amazonaws.com"

The role also needs permissions to list and execute SSM documents:

            - Sid: RequiredReadActionsforAWSFIS
              Effect: Allow
              Action:
                - 'cloudwatch:DescribeAlarms'
                - 'ssm:GetAutomationExecution'
                - 'ssm:ListCommands'
                - 'iam:ListRoles'
              Resource: '*'
            - Sid: RequiredSSMStopActionforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:CancelCommand'
              Resource: '*'
            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

            - Sid: RequiredIAMPassRoleforSSMADocuments
              Effect: Allow
              Action: 'iam:PassRole'
              Resource: !Sub 'arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}'

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Wildcard policies - secure and flexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*'

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

            NotResource: 
              # Explicitly listed roles - secure but inflexible
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

            NotResource: 
              # Wildcard policies - secure and flexible
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*'
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*'
and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Explicitly listed resource - secure but inflexible
                # Note: the $DEFAULT at the end could also be an explicit version number
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

to

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Wildcard resources - secure and flexible
                # 
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*'

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy --stack-name test-fis-api-faults --template-file template2.yaml --capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles --query "Roles[?starts_with(RoleName,'FISAPI-')].RoleName" --output text)
for role in $CLEANUP_ROLES; do
  CLEANUP_POLICIES=$(aws iam list-attached-role-policies --role-name $role --query "AttachedPolicies[?starts_with(PolicyName,'FISAPI-')].PolicyName" --output text)
  for policy in $CLEANUP_POLICIES; do
    echo Detaching policy $policy from role $role
    aws iam detach-role-policy --role-name $role --policy-arn $policy
  done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity --query Account --output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm --recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack --stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete --stack-name test-fis-api-faults

Conclusion 

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

Dr. Rudolf Potucek

Dr. Rudolf Potucek is Startup Solutions Architect at Amazon Web Services. Over the past 30 years he gained a PhD and worked in different roles including leading teams in academia and industry, as well as consulting. He brings experience from working with academia, startups, and large enterprises to his current role of guiding startup customers to succeed in the cloud.

Rudolph Wagner

Rudolph Wagner is a Premium Support Engineer at Amazon Web Services who holds the CISSP and OSCP security certifications, in addition to being a certified AWS Solutions Architect Professional. He assists internal and external Customers with multiple AWS services by using his diverse background in SAP, IT, and construction.

Configure ADFS Identity Federation with Amazon QuickSight

Post Syndicated from Adeleke Coker original https://aws.amazon.com/blogs/big-data/configure-adfs-identity-federation-with-amazon-quicksight/

Amazon QuickSight Enterprise edition can integrate with your existing Microsoft Active Directory (AD), providing federated access using Security Assertion Markup Language (SAML) to dashboards. Using existing identities from Active Directory eliminates the need to create and manage separate user identities in AWS Identity Access Management (IAM). Federated users assume an IAM role when access is requested through an identity provider (IdP) such as Active Directory Federation Service (AD FS) based on AD group membership. Although, you can connect AD to QuickSight using AWS Directory Service, this blog focuses on federated logon to QuickSight Dashboards.

With identity federation, your users get one-click access to Amazon QuickSight applications using their existing identity credentials. You also have the security benefit of identity authentication by your IdP. You can control which users have access to QuickSight using your existing IdP. Refer to Using identity federation and single sign-on (SSO) with Amazon QuickSight for more information.

In this post, we demonstrate how you can use a corporate email address as an authentication option for signing in to QuickSight. This post assumes you have an existing Microsoft Active Directory Federation Services (ADFS) configured in your environment.

Solution overview

While connecting to QuickSight from an IdP, your users initiate the sign-in process from the IdP portal. After the users are authenticated, they are automatically signed in to QuickSight. After QuickSight checks that they are authorized, your users can access QuickSight.

The following diagram shows an authentication flow between QuickSight and a third-party IdP. In this example, the administrator has set up a sign-in page to access QuickSight. When a user signs in, the sign-in page posts a request to a federation service that complies with SAML 2.0. The end-user initiates authentication from the sign-in page of the IdP. For more information about the authentication flow, see Initiating sign-on from the identity provider (IdP).

QuickSight IdP flow

The solution consists of the following high-level steps:

  1. Create an identity provider.
  2. Create IAM policies.
  3. Create IAM roles.
  4. Configure AD groups and users.
  5. Create a relying party trust.
  6. Configure claim rules.
  7. Configure QuickSight single sign-on (SSO).
  8. Configure the relay state URL for QuickStart.

Prerequisites

The following are the prerequisites to build the solution explained in this post:

  • An existing or newly deployed AD FS environment.
  • An AD user with permissions to manage AD FS and AD group membership.
  • An IAM user with permissions to create IAM policies and roles, and administer QuickSight.
  • The metadata document from your IdP. To download it, refer to Federation Metadata Explorer.

Create an identity provider

To add your IdP, complete the following steps:

  1. On the IAM console, choose Identity providers in the navigation pane.
  2. Choose Add provider.
  3. For Provider type¸ select SAML.
  4. For Provider name, enter a name (for example, QuickSight_Federation).
  5. For Metadata document, upload the metadata document you downloaded as a prerequisite.
  6. Choose Add provider.
  7. Copy the ARN of this provider to use in a later step.

Add IdP in IAM

Create IAM policies

In this step, you create IAM policies that allow users to access QuickSight only after federating their identities. To provide access to QuickSight and also the ability to create QuickSight admins, authors (standard users), and readers, use the following policy examples.

The following code is the author policy:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateUser"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

The following code is the reader policy:

{ 
"Version": "2012-10-17", 
"Statement": [ 
{ 
"Effect": "Allow",
"Action": "quicksight:CreateReader", 
"Resource": "*" 
} 
] 
}

The following code is the admin policy:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateAdmin"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Create IAM roles

You can configure email addresses for your users to use when provisioning through your IdP to QuickSight. To do this, add the sts:TagSession action to the trust relationship for the IAM role that you use with AssumeRoleWithSAML. Make sure the IAM role names start with ADFS-.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create new role.
  3. For Trusted entity type, select SAML 2.0 federation.
  4. Choose the SAML IdP you created earlier.
  5. Select Allow programmatic and AWS Management Console access.
  6. Choose Next.
    Create IAM Roles
  7. Choose the admin policy you created, then choose Next.
  8. For Name, enter ADFS-ACCOUNTID-QSAdmin.
  9. Choose Create.
  10. On the Trust relationships tab, edit the trust relationships as follows so you can pass principal tags when users assume the role (provide your account ID and IdP):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::ACCOUNTID:saml-provider/Identity_Provider"
            },
            "Action":[ "sts:AssumeRoleWithSAML",
	 "sts:TagSession"],
            "Condition": {
                "StringEquals": {
                    "SAML:aud": "https://signin.aws.amazon.com/saml"
                },
                "StringLike": {
                    "aws:RequestTag/Email": "*"
                }
            }
            }
    ]
}
  1. Repeat this process for the role ADFS-ACCOUNTID-QSAuthor and attach the author IAM policy.
  2. Repeat this process for the role ADFS-ACCOUNTID-QSReader and attach the reader IAM policy.

Configure AD groups and users

Now you need to create AD groups that determine the permissions to sign in to AWS. Create an AD security group for each of the three roles you created earlier. Note that the group name should follow same format as your IAM role names.

One approach for creating the AD groups that uniquely identify the IAM role mapping is by selecting a common group naming convention. For example, your AD groups would start with an identifier, for example AWS-, which will distinguish your AWS groups from others within the organization. Next, include the 12-digit AWS account number. Finally, add the matching role name within the AWS account. You should do this for each role and corresponding AWS account you wish to support with federated access. The following screenshot shows an example of the naming convention we use in this post.

AD Groups

Later in this post, we create a rule to pick up AD groups starting with AWS-, the rule will remove AWS-ACCOUNTID- from AD groups name to match the respective IAM role, which is why we use this naming convention here.

Users in Active Directory can subsequently be added to the groups, providing the ability to assume access to the corresponding roles in AWS. You can add AD users to the respective groups based on your business permissions model. Note that each user must have an email address configured in Active Directory.

Create a relying party trust

To add a relying party trust, complete the following steps:

  1. Open the AD FS Management Console.
  2. Choose (right-click) Relying Party Trusts, then choose Add Relying Party Trust.
    Add Relying Party Trust
  3. Choose Claims aware, then choose Start.
  4. Select Import data about the relying party published online or on a local network.
  5. For Federation metadata address, enter https://signin.aws.amazon.com/static/saml-metadata.xml.
  6. Choose Next.
    ADFS Wizard Data Source
  7. Enter a descriptive display name, for example Amazon QuickSight Federation, then choose Next.
  8. Choose your access control policy (for this post, Permit everyone), then choose Next.
    ADFS Access Control
  9. In the Ready to Add Trust section, choose Next.
    ADFS Ready to Add
  10. Leave the defaults, then choose Close.

Configure claim rules

In this section, you create claim rules that identify accounts, set LDAP attributes, get the AD groups, and match them to the roles created earlier. Complete the following steps to create the claim rules for NameId, RoleSessionName, Get AD Groups, Roles, and (optionally) Session Duration:

  1. Select the relying party trust you just created, then choose Edit Claim Issuance Policy.
  2. Add a rule called NameId with the following parameters:
    1. For Claim rule template, choose Transform an Incoming Claim.
    2. For Claim rule name, enter NameId
    3. For Incoming claim type, choose Windows account name.
    4. For Outgoing claim type, choose Name ID.
    5. For Outgoing name ID format, choose Persistent Identifier.
    6. Select Pass through all claim values.
    7. Choose Finish.
      NameId
  3. Add a rule called RoleSessionName with the following parameters:
    1. For Claim rule template, choose Send LDAP Attributes as Claims.
    2. For Claim rule name, enter RoleSessionName.
    3. For Attribute store, choose Active Directory.
    4. For LDAP Attribute, choose E-Mail-Addresses.
    5. For Outgoing claim type, enter https://aws.amazon.com/SAML/Attributes/RoleSessionName.
    6. Add another E-Mail-Addresses LDAP attribute and for Outgoing claim type, enter https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email.
    7. Choose OK.
      RoleSessionName
  4. Add a rule called Get AD Groups with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Get AD Groups
    3. For Custom Rule, enter the following code:
      c:[Type == "http://schemas.microsoft.com/ws/2008/06/identity/claims/windowsaccountname", Issuer == "AD AUTHORITY"] => add(store = "Active Directory", types = ("http://temp/variable"), query = ";tokenGroups;{0}", param = c.Value);

    4. Choose OK.
      Get AD Groups
  1. Add a rule called Roles with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Roles
    3. For Custom Rule, enter the following code (provide your account ID and IdP):
      c:[Type == "http://temp/variable", Value =~ "(?i)^AWS-ACCOUNTID"]=&gt; issue(Type = "https://aws.amazon.com/SAML/Attributes/Role", Value = RegExReplace(c.Value, "AWS-ACCOUNTID-", "arn:aws:iam:: ACCOUNTID:saml-provider/your-identity-provider-name,arn:aws:iam:: ACCOUNTID:role/ADFS-ACCOUNTID-"));

    4. Choose Finish.Roles

Optionally, you can create a rule called Session Duration. This configuration determines how long a session is open and active before users are required to reauthenticate. The value is in seconds. For this post, we configure the rule for 8 hours.

  1. Add a rule called Session Duration with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Session Duration.
    3. For Custom Rule, enter the following code:
      => issue(Type = "https://aws.amazon.com/SAML/Attributes/SessionDuration", Value = "28800");

    4. Choose Finish.Session Duration

You should be able to see these five claim rules, as shown in the following screenshot.
All Claims Rules

  1. Choose OK.
  2. Run the following commands in PowerShell on your AD FS server:
Set-AdfsProperties -EnableIdPInitiatedSignonPage $true

Set-AdfsProperties -EnableRelayStateForIdpInitiatedSignOn $true
  1. Stop and start the AD FS service from PowerShell:
net stop adfssrv

net start adfssrv

Configure E-mail Syncing

With QuickSight Enterprise edition integrated with an IdP, you can restrict new users from using personal email addresses. This means users can only log in to QuickSight with their on-premises configured email addresses. This approach allows users to bypass manually entering an email address. It also ensures that users can’t use an email address that might differ from the email address configured in Active Directory.

QuickSight uses the preconfigured email addresses passed through the IdP when provisioning new users to your account. For example, you can make it so that only corporate-assigned email addresses are used when users are provisioned to your QuickSight account through your IdP. When you configure email syncing for federated users in QuickSight, users who log in to your QuickSight account for the first time have preassigned email addresses. These are used to register their accounts.

To configure E-mail syncing for federated users in QuickSight, complete the following steps:

  1. Log in to your QuickSight dashboard with a QuickSight administrator account.
  2. Choose the profile icon.
    QuickSight SSO
  3. On the drop-down menu, choose on Manage QuickSight.
  4. In the navigation pane, choose Single sign-on (SSO).
  5. For Email Syncing for Federated Users, select ON, then choose Enable in the pop-up window.
  6. Choose Save.SSO Configuration

Configure the relay state URL for QuickStart

To configure the relay state URL, complete the following steps (revise the input information as needed to match your environment’s configuration):

  1. Use the ADFS RelayState Generator to generate your URL.
  2. For IDP URL String, enter https://ADFSServerEndpoint/adfs/ls/idpinitiatedsignon.aspx.
  3. For Relying Party Identifier, enter urn:amazon:webservices or https://signin.aws.amazon.com/saml.
  4. For Relay State/Target App, enter your authenticated users to access. In this case, it’s https://quicksight.aws.amazon.com.
  5. Choose Generate URL.RelayState Generator
  6. Copy the URL and load it in your browser.

You should be presented with a login to your IdP landing page.

ADFS Logon Page

Make sure the user logging in has an email address attribute configured in Active Directory. A successful login should redirect you to the QuickSight dashboard after authentication. If you’re not redirected to the QuickSight dashboard page, make sure you ran the commands listed earlier after you configured your claim rules.

Summary

In this post, we demonstrated how to configure federated identities to a QuickSight dashboard and ensure that users can only sign in with preconfigured email address in your existing Active Directory.

We’d love to hear from you. Let us know what you think in the comments section.


About the Author

Adeleke Coker is a Global Solutions Architect with AWS. He helps customers globally accelerate workload deployments and migrations at scale to AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

How Vanguard made their technology platform resilient and efficient by building cross-Region replication for Amazon Kinesis Data Streams

Post Syndicated from Raghu Boppanna original https://aws.amazon.com/blogs/big-data/how-vanguard-made-their-technology-platform-resilient-and-efficient-by-building-cross-region-replication-for-amazon-kinesis-data-streams/

This is a guest post co-written with Raghu Boppanna from Vanguard. 

At Vanguard, the Enterprise Advice line of business improves investor outcomes through digital access to superior, personalized, and affordable financial advice. They made it possible, in part, by driving economies of scale across the globe for investors with a highly resilient and efficient technical platform. Vanguard opted for a multi-Region architecture for this workload to help protect against impairments of Regional services. For high availability purposes, there is a need to make the data used by the workload available not just in the primary Region, but also in the secondary Region with minimal replication lag. In the event of a service impairment in the primary Region, the solution should be able to fail over to the secondary Region with as little data loss as possible and the ability to resume data ingestion.

Vanguard Cloud Technology Office and AWS partnered to build an infrastructure solution on AWS that met their resilience requirements. The multi-Region solution enables a robust fail-over mechanism, with built-in observability and recovery. The solution also supports streaming data from multiple sources to different Kinesis data streams. The solution is currently being rolled out to the different lines of business teams to improve the resilience posture of their workloads.

The use case discussed here requires Change Data Capture (CDC) to stream data from a remote data source (mainframe DB2) to Amazon Kinesis Data Streams, because the business capability depends on this data. Kinesis Data Streams is a fully managed, massively scalable, durable, and low-cost streaming service that can continuously capture and stream large amounts of data from multiple sources, and makes the data available for consumption within milliseconds. The service is built to be highly resilient and uses multiple Availability Zones to process and store data.

The solution discussed in this post explains how AWS and Vanguard innovated to build a resilient architecture to meet their high availability goals.

Solution overview

The solution uses AWS Lambda to replicate data from Kinesis data streams in the primary Region to a secondary Region. In the event of any service impairment impacting the CDC pipeline, the failover process promotes the secondary Region to primary for the producers and consumers. We use Amazon DynamoDB global tables for replication checkpoints that allows to resume data streaming from the checkpoint and also maintains a primary Region configuration flag that prevents an infinite replication loop of the same data back and forth.

The solution also provides the flexibility for Kinesis Data Streams consumers to use the primary or any secondary Region within the same AWS account.

The following diagram illustrates the reference architecture.

Let’s look at each component in detail:

  1. CDC processor (producer) – In this reference architecture, the producer is deployed on Amazon Elastic Compute Cloud (Amazon EC2) in both the primary and secondary Regions, and is active in the primary Region and on standby mode in the secondary Region. It captures CDC data from the external data source (like a DB2 database as shown in the architecture above), and streams to Kinesis Data Streams in the primary Region. Vanguard uses a 3rd party tool Qlik Replicate as their CDC Processor. It produces a well-formed payload including the DB2 commit timestamp to the Kinesis data stream, in addition to the actual row data from the remote data source. (example-stream-1 in this example). The following code is a sample payload containing only the primary key of the record that changed and the commit timestamp (for simplicity, the rest of the table row data is not shown below):
    {
        "eventSource": "aws:kinesis",
        "kinesis": 
        {
             "ApproximateArrivalTimestamp": "Mon July 18 20:00:00 UTC 2022",
             "SequenceNumber": "49544985256907370027570885864065577703022652638596431874",
             "PartitionKey": "12349999",
             "KinesisSchemaVersion": "1.0",
             "Data": "eyJLZXkiOiAxMjM0OTk5OSwiQ29tbWl0VGltZXN0YW1wIjogIjIwMjItMDctMThUMjA6MDA6MDAifQ=="
        },
        "eventId": "shardId-000000000000:49629136582982516722891309362785181370337771525377097730",
        "invokeIdentityArn": "arn:aws:iam::6243876582:role/kds-crr-LambdaRole-1GZWP67437SD",
        "eventName": "aws:kinesis:record",
        "eventVersion": "1.0",
        "eventSourceARN": "arn:aws:kinesis:us-east-1:6243876582:stream/kds-stream-1/consumer/kds-crr:6243876582",
        "awsRegion": "us-east-1"
    }

    The Base64 decoded value of Data is as follows. The actual Kinesis record would contain the entire row data of the table row that changed, in addition to the primary key and the commit timestamp.

    {"Key": 12349999,"CommitTimestamp": "2022-07-18T20:00:00"}

    The CommitTimestamp in the Data field is used in the replication checkpoint and is critical to accurately track how much of the stream data has been replicated to the secondary Region. The checkpoint can then be used to facilitate a CDC processor (producer) failover and accurately resume producing data from the replication checkpoint timestamp onwards.

    The alternative to using a remote data source CommitTimestamp (if unavailable) is to use the ApproximateArrivalTimestamp (which is the timestamp when the record is actually written to the data stream).

  2. Cross-Region replication Lambda function – The function is deployed to both primary and secondary Regions. It’s set up with an event source mapping to the data stream containing CDC data. The same function can be used to replicate data of multiple streams. It’s invoked with a batch of records from Kinesis Data Streams and replicates the batch to a target replication Region (which is provided via the Lambda configuration environment). For cost considerations, if the CDC data is actively produced into the primary Region only, the reserved concurrency of the function in the secondary Region can be set to zero, and modified during regional failover. The function has AWS Identity and Access Management (IAM) role permissions to do the following:
    • Read and write to the DynamoDB global tables used in this solution, within the same account.
    • Read and write to Kinesis Data Streams in both Regions within the same account.
    • Publish custom metrics to Amazon CloudWatch in both Regions within the same account.
  3. Replication checkpoint – The replication checkpoint uses the DynamoDB global table in both the primary and secondary Regions. It’s used by the cross-Region replication Lambda function to persist the commit timestamp of the last replication record as the replication checkpoint for every stream that is configured for replication. For this post, we create and use a global table called kdsReplicationCheckpoint.
  4. Active Region config – The active Region uses the DynamoDB global table in both primary and secondary Regions. It uses the native cross-Region replication capability of the global table to replicate the configuration. It’s pre-populated with data about which is the primary Region for a stream, to prevent replication back to the primary Region by the Lambda function in the standby Region. This configuration may not be required if the Lambda function in the standby Region has a reserved concurrency set to zero, but can serve as a safety check to avoid infinite replication loop of the data. For this post, we create a global table called kdsActiveRegionConfig and put an item with the following data:
    {
     "stream-name": "example-stream-1",
     "active-region" : "us-east-1"
    }
    
  5. Kinesis Data Streams – The stream to which the CDC processor produces the data. For this post, we use a stream called example-stream-1 in both the Regions, with the same shard configuration and access policies.

Sequence of steps in cross-Region replication

Let’s briefly look at how the architecture is exercised using the following sequence diagram.

The sequence consists of the following steps:

  1. The CDC processor (in us-east-1) reads the CDC data from the remote data source.
  2. The CDC processor (in us-east-1) streams the CDC data to Kinesis Data Streams (in us-east-1).
  3. The cross-Region replication Lambda function (in us-east-1) consumes the data from the data stream (in us-east-1). The enhanced fan-out pattern is recommended for dedicated and increased throughput for cross-Region replication.
  4. The replicator Lambda function (in us-east-1) validates its current Region with the active Region configuration for the stream being consumed, with the help of the kdsActiveRegionConfig DynamoDB global tableThe following sample code (in Java) can help illustrate the condition being evaluated:
    // Fetch the current AWS Region from the Lambda function’s environment
    String currentAWSRegion = System.getenv(“AWS_REGION”);
    // Read the stream name from the first Kinesis Record once for the entire batch being processed. This is done because we are reusing the same Lambda function for replicating multiple streams.
    String currentStreamNameConsumed = kinesisRecord.getEventSourceARN().split(“:”)[5].split(“/”)[1];
    // Build the DynamoDB query condition using the stream name
    Map<String, Condition> keyConditions = singletonMap(“streamName”, Condition.builder().comparisonOperator(EQ).attributeValueList(AttributeValue.builder().s(currentStreamNameConsumed).build()).build());
    // Query the DynamoDB Global Table
    QueryResponse queryResponse = ddbClient.query(QueryRequest.builder().tableName("kdsActiveRegionConfig").keyConditions(keyConditions).attributesToGet(“ActiveRegion”).build());
  5. The function evaluates the response from DynamoDB with the following code:
    // Evaluate the response
    if (queryResponse.hasItems()) {
           AttributeValue activeRegionForStream = queryResponse.items().get(0).get(“ActiveRegion”);
           return currentAWSRegion.equalsIgnoreCase(activeRegionForStream.s());
    }
  6. Depending on the response, the function takes the following actions:
    1. If the response is true, the replicator function produces the records to Kinesis Data Streams in us-east-2 in a sequential manner.
      • If there is a failure, the sequence number of the record is tracked and the iteration is broken. The function returns the list of failed sequence numbers. By returning the failed sequence number, the solution uses the feature of Lambda checkpointing to be able to resume processing of a batch of records with partial failures. This is useful when handling any service impairments, where the function tries to replicate the data across Regions to ensure stream parity and no data loss.
      • If there are no failures, an empty list is returned, which indicates the batch was successful.
    2. If the response is false, the replicator function returns without performing any replication. To reduce the cost of the Lambda invocations, you can set the reserved concurrency of the function in the DR Region (us-east-2) to zero. This will prevent the function from being invoked. When you failover, you can update this value to an appropriate number based on the CDC throughput and set the reserved concurrency of the function in us-east-1 to zero to prevent it from executing unnecessarily.
  7. After all the records are produced to Kinesis Data Streams in us-east-2, the replicator function checkpoints to the kdsReplicationCheckpoint DynamoDB global table (in us-east-1) with the following data:
    { "streamName": "example-stream-1", "lastReplicatedTimestamp": "2022-07-18T20:00:00" }
    
  8. The function returns after successfully processing the batch of records.

Performance considerations

The performance expectations of the solution should be understood with respect to the following factors:

  • Region selection – The replication latency is directly proportional to the distance being traveled by the data, so understand your Region selection
  • Velocity – The incoming velocity of the data or the volume of data being replicated
  • Payload size – The size of the payload being replicated

Monitor the Cross-Region replication

It’s recommended to track and observe the replication as it happens. You can tailor the Lambda function to publish custom metrics to CloudWatch with the following metrics at the end of every invocation. Publishing these metrics to both the primary and secondary Regions helps protect yourself from impairments affecting observability in the primary Region.

  • Throughput – The current Lambda invocation batch size
  • ReplicationLagSeconds – The difference between the current timestamp (after processing all the records) and the ApproximateArrivalTimestamp of the last record that was replicated

The following example CloudWatch metric graph shows the average replication lag was 2 seconds with a throughput of 100 records replicated from us-east-1 to us-east-2.

Common failover strategy

During any impairments impacting the CDC pipeline in the primary Region, business continuity or disaster recovery needs may dictate a pipeline failover to the secondary (standby) Region. This means a couple of things need to be done as part of this failover process:

  • If possible, stop all the CDC tasks in the CDC processor tool in us-east-1.
  • The CDC processor must be failed over to the secondary Region, so that it can read the CDC data from the remote data source while operating out of the standby Region.
  • The kdsActiveRegionConfig DynamoDB global table needs to be updated. For instance, for the stream example-stream-1 used in our example, the active Region is changed to us-east-2:
{
"stream-name": "example-stream-1",
"active-Region" : "us-east-2"
}
  • All the stream checkpoints need to be read from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2), and the timestamps from each of the checkpoints are used to start the CDC tasks in the producer tool in us-east-2 Region. This minimizes the chances of data loss and accurately resumes streaming the CDC data from the remote data source from the checkpoint timestamp onwards.
  • If using reserved concurrency to control Lambda invocations, set the value to zero in the primary Region(us-east-1) and to a suitable non-zero value in the secondary Region(us-east-2).

Vanguard’s multi-step failover strategy

Some of the third-party tools that Vanguard uses have a two-step CDC process of streaming data from a remote data source to a destination. Vanguard’s tool of choice for their CDC processor follows this two-step approach:

  1. The first step involves setting up a log stream task that reads the data from the remote data source and persists in a staging location.
  2. The second step involves setting up individual consumer tasks that read data from the staging location—which could be on Amazon Elastic File System (Amazon EFS) or Amazon FSx, for example—and stream it to the destination. The flexibility here is that each of these consumer tasks can be triggered to stream from different commit timestamps. The log stream task usually starts reading data from the minimum of all the commit timestamps used by the consumer tasks.

Let’s look at an example to explain the scenario:

  • Consumer task A is streaming data from a commit timestamp 2022-07-19T20:00:00 onwards to example-stream-1.
  • Consumer task B is streaming data from a commit timestamp 2022-07-19T21:00:00 onwards to example-stream-2.
  • In this situation, the log stream should read data from the remote data source from the minimum of the timestamps used by the consumer tasks, which is 2022-07-19T20:00:00.

The following sequence diagram demonstrates the exact steps to run during a failover to us-east-2 (the standby Region).

The steps are as follows:

  1. The failover process is triggered in the standby Region (us-east-2 in this example) when required. Note that the trigger can be automated using comprehensive health checks of the pipeline in the primary Region.
  2. The failover process updates the kdsActiveRegionConfig DynamoDB global table with the new value for the Region as us-east-2 for all the stream names.
  3. The next step is to fetch all the stream checkpoints from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2).
  4. After the checkpoint information is read, the failover process finds the minimum of all the lastReplicatedTimestamp.
  5. The log stream task in the CDC processor tool is started in us-east-2 with the timestamp found in Step 4. It begins reading CDC data from the remote data source from this timestamp onwards and persists them in the staging location on AWS.
  6. The next step is to start all the consumer tasks to read data from the staging location and stream to the destination data stream. This is where each consumer task is supplied with the appropriate timestamp from the kdsReplicationCheckpoint table according to the streamName to which the task streams the data.

After all the consumer tasks are started, data is produced to the Kinesis data streams in us-east-2. From there on, the process of cross-Region replication is the same as described earlier – the replication Lambda function in us-east-2 starts replicating data to the data stream in us-east-1.

The consumer applications reading data from the streams are expected to be idempotent to be able to handle duplicates. Duplicates can be introduced in the stream due to many reasons, some of which are called out below.

  • The Producer or the CDC Processor introduces duplicates into the stream while replaying the CDC data during a failover
  • DynamoDB Global Table uses asynchronous replication of data across Regions and if the kdsReplicationCheckpoint table data has a replication lag, the failover process may potentially use an older checkpoint timestamp to replay the CDC data.

Also, consumer applications should checkpoint the CommitTimestamp of the last record that was consumed. This is to facilitate better monitoring and recovery.

Path to maturity: Automated recovery

The ideal state is to fully automate the failover process, reducing time to recover and meeting the resilience Service Level Objective (SLO). However, in most organizations, the decision to fail over, fail back, and trigger the failover requires manual intervention in assessing the situation and deciding the outcome. Creating scripted automation to perform the failover that can be run by a human is a good place to start.

Vanguard has automated all of the steps of failover, but still have humans make the decision on when to invoke it. You can customize the solution to meet your needs and depending on the CDC processor tool you use in your environment.

Conclusion

In this post, we described how Vanguard innovated and built a solution for replicating data across Regions in Kinesis Data Streams to make the data highly available. We also demonstrated a robust checkpoint strategy to facilitate a Regional failover of the replication process when needed. The solution also illustrated how to use DynamoDB global tables for tracking the replication checkpoints and configuration. With this architecture, Vanguard was able to deploy workloads depending on the CDC data to multiple Regions to meet business needs of high availability in the face of service impairments impacting CDC pipelines in the primary Region.

If you have any feedback please leave a comment in the Comments section below.


About the authors

Raghu Boppanna works as an Enterprise Architect at Vanguard’s Chief Technology Office. Raghu specializes in Data Analytics, Data Migration/Replication including CDC Pipelines, Disaster Recovery and Databases. He has earned several AWS Certifications including AWS Certified Security – Specialty & AWS Certified Data Analytics – Specialty.

Parameswaran V Vaidyanathan is a Senior Cloud Resilience Architect with Amazon Web Services. He helps large enterprises achieve the business goals by architecting and building scalable and resilient solutions on the AWS Cloud.

Richa Kaul is a Senior Leader in Customer Solutions serving Financial Services customers. She is based out of New York. She has extensive experience in large scale cloud transformation, employee excellence, and next generation digital solutions. She and her team focus on optimizing value of cloud by building performant, resilient and agile solutions. Richa enjoys multi sports like triathlons, music, and learning about new technologies.

Mithil Prasad is a Principal Customer Solutions Manager with Amazon Web Services. In his role, Mithil works with Customers to drive cloud value realization, provide thought leadership to help businesses achieve speed, agility, and innovation.

Control access to Amazon OpenSearch Service Dashboards with attribute-based role mappings

Post Syndicated from Stefan Appel original https://aws.amazon.com/blogs/big-data/control-access-to-amazon-opensearch-service-dashboards-with-attribute-based-role-mappings/

Federated users of Amazon OpenSearch Service often need access to OpenSearch Dashboards with roles based on their user profiles. OpenSearch Service fine-grained access control maps authenticated users to OpenSearch Search roles and then evaluates permissions to determine how to handle the user’s actions. However, when an enterprise-wide identity provider (IdP) manages the users, the mapping of users to OpenSearch Service roles often needs to happen dynamically based on IdP user attributes. One option to map users is to use OpenSearch Service SAML integration and pass user group information to OpenSearch Service. Another option is Amazon Cognito role-based access control, which supports rule-based or token-based mappings. But neither approach supports arbitrary role mapping logic. For example, when you need to interpret multivalued user attributes to identify a target role.

This post shows how you can implement custom role mappings with an Amazon Cognito pre-token generation AWS Lambda trigger. For our example, we use a multivalued attribute provided over OpenID Connect (OIDC) to Amazon Cognito. We show how you are in full control of the mapping logic and process of such a multivalued attribute for AWS Identity and Access Management (IAM) role lookups. Our approach is generic for OIDC-compatible IdPs. To make this post self-contained, we use the Okta IdP as an example to walk through the setup.

Overview of solution

The provided solution intercepts the OICD-based login process to OpenSearch Dashboards with a pre-token generation Lambda function. The login to OpenSearch Dashboards with a third-party IdP and Amazon Cognito as an intermediary consists of several steps:

  1. First, the initial user request to OpenSearch Dashboard is redirected to Amazon Cognito.
  2. Amazon Cognito redirects the request to the IdP for authentication.
  3. After the user authenticates, the IdP sends the identity token (ID token) back to Amazon Cognito.
  4. Amazon Cognito invokes a Lambda function that modifies the obtained token. We use an Amazon DynamoDB table to perform role mapping lookups. The modified token now contains the IAM role mapping information.
  5. Amazon Cognito uses this role mapping information to map the user to the specified IAM role and provides the role credentials.
  6. OpenSearch Service maps the IAM role credentials to OpenSearch roles and applies fine-grained permission checks.

The following architecture outlines the login flow from a user’s perspective.

Scope of solution

On the backend, OpenSearch Dashboards integrates with an Amazon Cognito user pool and an Amazon Cognito identity pool during the authentication flow. The steps are as follows:

  1. Authenticate and get tokens.
  2. Look up the token attribute and IAM role mapping and overwrite the Amazon Cognito attribute.
  3. Exchange tokens for AWS credentials used by OpenSearch dashboards.

The following architecture shows this backend perspective to the authentication process.

Backend authentication flow

In the remainder of this post, we walk through the configurations necessary for an authentication flow in which a Lambda function implements custom role mapping logic. We provide sample Lambda code for the mapping of multivalued OIDC attributes to IAM roles based on a DynamoDB lookup table with the following structure.

OIDC Attribute Value IAM Role
["attribute_a","attribute_b"] arn:aws:iam::<aws-account-id>:role/<role-name-01>
["attribute_a","attribute_x"] arn:aws:iam::<aws-account-id>:role/<role-name-02>

The high-level steps of the solution presented in this post are as follows:

  1. Configure Amazon Cognito authentication for OpenSearch Dashboards.
  2. Add IAM roles for mappings to OpenSearch Service roles.
  3. Configure the Okta IdP.
  4. Add a third-party OIDC IdP to the Amazon Cognito user pool.
  5. Map IAM roles to OpenSearch Service roles.
  6. Create the DynamoDB attribute-role mapping table.
  7. Deploy and configure the pre-token generation Lambda function.
  8. Configure the pre-token generation Lambda trigger.
  9. Test the login to OpenSearch Dashboards.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account with an OpenSearch Service domain.
  • A third-party IdP that supports OpenID Connect and adds a multivalued attribute in the authorization token. For this post, we use attributes_array as this attribute’s name and Okta as an IdP provider. You can create an Okta Developer Edition free account to test the setup.

Configure Amazon Cognito authentication for OpenSearch Dashboards

The modification of authentication tokens requires you to configure the OpenSearch Service domain to use Amazon Cognito for authentication. For instructions, refer to Configuring Amazon Cognito authentication for OpenSearch Dashboards.

The Lambda function implements custom role mappings by setting the cognito:preferred_role claim (for more information, refer to Role-based access control). For the correct interpretation of this claim, set the Amazon Cognito identity pool to Choose role from token. The Amazon Cognito identity pool then uses the value of the cognito:preferred_role claim to select the correct IAM role. The following screenshot shows the required settings in the Amazon Cognito identity pool that is created during the configuration of Amazon Cognito authentication for OpenSearch Service.

Cognito role mapping configuration

Add IAM roles for mappings to OpenSearch roles

IAM roles used for mappings to OpenSearch roles require a trust policy so that authenticated users can assume them. The trust policy needs to reference the Amazon Cognito identity pool created during the configuration of Amazon Cognito authentication for OpenSearch Service. Create at least one IAM role with a custom trust policy. For instructions, refer to Creating a role using custom trust policies. The IAM role doesn’t require the attachment of a permission policy. For a sample trust policy, refer to Role-based access control.

Configure the Okta IdP

In this section, we describe the configuration steps to include a multivalued attribute_array attribute in the token provided by Okta. For more information, refer to Customize tokens returned from Okta with custom claims. We use the Okta UI to perform the configurations. Okta also provides an API that you can use to script and automate the setup.

The first step is adding the attributes_array attribute to the Okta user profile.

  1. Use Okta’s Profile Editor under Directory, Profile Editor.
  2. Select User (default) and then choose Add Attribute.
  3. Add an attribute with a display name and variable name attributes_array of type string array.

The following screenshot shows the Okta default user profile after the custom attribute has been added.

Okta user profile editor

  1. Next, add attributes_array attribute values to users using Okta’s user management interface under Directory, People.
  2. Select a user and choose Profile.
  3. Choose Edit and enter attribute values.

The following screenshot shows an example of attributes_array attribute values within a user profile.

Okta user attributes array

The next step is adding the attributes_array attribute to the ID token that is generated during the authentication process.

  1. On the Okta console, choose Security, API and select the default authorization server.
  2. Choose Claims and choose Add Claim to add the attributes_array attribute as part of the ID token.
  3. As the scope, enter openid and as the attribute value, enter user.attributes_array.

This references the previously created attribute in a user’s profile.

Add claim to ID token

  1. Next, create an application for the federation with Amazon Cognito. For instructions, refer to How do I set up Okta as an OpenID Connect identity provider in an Amazon Cognito user pool.

The last step assigns the Okta application to Okta users.

  1. Navigate to Directory, People, select a user, and choose Assign Applications.
  2. Select the application you created in the previous step.

Add a third-party OIDC IdP to the Amazon Cognito user pool

We are implementing the role mapping based on the information provided in a multivalued OIDC attribute. The authentication token needs to include this attribute. If you followed the previously described Okta configuration, the attribute is automatically added to the ID token of a user. If you used another IdP, you might have to request the attribute explicitly. For this, add the attribute name to the Authorized scopes list of the IdP in Amazon Cognito.

For instructions on how to set up the federation between a third-party IdP and an Amazon Cognito user pool and how to request additional attributes, refer to Adding OIDC identity providers to a user pool. For a detailed walkthrough for Okta, refer to How do I set up Okta as an OpenID Connect identity provider in an Amazon Cognito user pool.

After requesting the token via OIDC, you need to map the attribute to an Amazon Cognito user pool attribute. For instructions, refer to Specifying identity provider attribute mappings for your user pool. The following screenshot shows the resulting configuration on the Amazon Cognito console.

Amazon Cognito user pool attribute mapping

Map IAM roles to OpenSearch Service roles

Upon login, OpenSearch Service maps users to an OpenSearch Service role based on the IAM role ARN set in the cognito:preferred_role claim by the pre-token generation Lambda trigger. This requires a role mapping in OpenSearch Service. To add such role mappings to IAM backend roles, refer to Mapping roles to users. The following screenshot shows a role mapping on the OpenSearch Dashboards console.

Amazon OpenSearch Service role mappings

Create the attribute-role mapping table

For this solution, we use DynamoDB to store mappings of users to IAM roles. For instructions, refer to Create a table and define a partition key named Key of type String. You need the table name in the subsequent step to configure the Lambda function.

The next step is writing the mapping information into the table. A mapping entry consists of the following attributes:

  • Key – A string that contains attribute values in comma-separated alphabetical order
  • RoleArn – A string with the IAM role ARN to which the attribute value combination should be mapped

For details on how to add data to a DynamoDB table, refer to Write data to a table using the console or AWS CLI.

For example, if the previously configured OIDC attribute attributes_array contains three values, attribute_a, attribute_b, and attribute_c, the entry in the mapping table looks like table line 1 in the following screenshot.

Amazon DynamoDB table with attribute-role mappings

Deploy and configure the pre-token generation Lambda function

A Lambda function implements the custom role mapping logic. The Lambda function receives an Amazon Cognito event as input and extracts attribute information out of it. It uses the attribute information for a lookup in a DynamoDB table and retrieves the value for cognito:preferred_role. Follow the steps in Getting started with Lambda to create a Node.js Lambda function and insert the following source code:

const AWS = require("aws-sdk");
const tableName = process.env.TABLE_NAME;
const unauthorizedRoleArn = process.env.UNAUTHORIZED_ROLE;
const userAttributeArrayName = process.env.USER_POOL_ATTRIBUTE;
const dynamodbClient = new AWS.DynamoDB({apiVersion: "2012-08-10"});
exports.lambdaHandler = handlePreTokenGenerationEvent

async function handlePreTokenGenerationEvent (event, context) {
    var sortedAttributeList = getSortedAttributeList(event);
    var lookupKey = sortedAttributeList.join(',');
    var roleArn = await lookupIAMRoleArn(lookupKey);
    appendResponseWithPreferredRole(event, roleArn);
    return event;
}

function getSortedAttributeList(event) {
    return JSON.parse(event['request']['userAttributes'][userAttributeArrayName]).sort();
}

async function lookupIAMRoleArn(key) {
    var params = {
        TableName: tableName,
        Key: {
          'Key': {S: key}
        },
        ProjectionExpression: 'RoleArn'
      };
    try {
        let item = await dynamodbClient.getItem(params).promise();
        return item['Item']['RoleArn']['S'];
    } catch (e){
        console.log(e);
        return unauthorizedRoleArn; 
    }
}

function appendResponseWithPreferredRole(event, roleArn){
    event.response = {
        'claimsOverrideDetails': {
            'groupOverrideDetails': {
                'preferredRole': roleArn
            }
        }
    };
}

The Lambda function expects three environment variables. Refer to Using AWS Lambda environment variables for instructions to add the following entries:

  • TABLE_NAME – The name of the previously created DynamoDB table. This table is used for the lookups.
  • UNAUTHORIZED_ROLE – The ARN of the IAM role that is used when no mapping is found in the lookup table.
  • USER_POOL_ATTRIBUTE – The Amazon Cognito user pool attribute used for the IAM role lookup. In our example, this attribute is named custom:attributes_array.

The following screenshot shows the final configuration.

AWS Lamba function configuration

The Lambda function needs permissions to access the DynamoDB lookup table. Set permissions as follows: attach the following policy to the Lambda execution role (for instructions, refer to Lambda execution role) and provide the Region, AWS account number, and DynamoDB table name:

{
    "Statement": [
        {
            "Action": [
                "dynamodb:GetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:BatchGetItem",
                "dynamodb:DescribeTable"
            ],
            "Resource": [
                "arn:aws:dynamodb:<region>:<accountid>:table/<table>",
                "arn:aws:dynamodb:<region>:<accountid>:table/<table>/index/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The configuration of the Lambda function is now complete.

Configure the pre-token generation Lambda trigger

As final step, add a pre-token generation trigger to the Amazon Cognito user pool and reference the newly created Lambda function. For details, refer to Customizing user pool workflows with Lambda triggers. The following screenshot shows the configuration.

Amazon Cognito pre-token generation trigger configuration

This step completes the setup; Amazon Cognito now maps users to OpenSearch Service roles based on the values provided in an OIDC attribute.

Test the login to OpenSearch Dashboards

The following diagram shows an exemplary login flow and the corresponding screenshots for an Okta user user1 with a user profile attribute attribute_array and value: ["attribute_a", "attribute_b", "attribute_c"].

Testing of solution

Clean up

To avoid incurring future charges, delete the OpenSearch Service domain, Amazon Cognito user pool and identity pool, Lambda function, and DynamoDB table created as part of this post.

Conclusion

In this post, we demonstrated how to set up a custom mapping to OpenSearch Service roles using values provided via an OIDC attribute. We dynamically set the cognito:preferred_role claim using an Amazon Cognito pre-token generation Lambda trigger and a DynamoDB table for lookup. The solution is capable of handling dynamic multivalued user attributes, but you can extend it with further application logic that goes beyond a simple lookup. The steps in this post are a proof of concept. If you plan to develop this into a productive solution, we recommend implementing Okta and AWS security best practices.

The post highlights just one use case of how you can use Amazon Cognito support for Lambda triggers to implement custom authentication needs. If you’re interested in further details, refer to How to Use Cognito Pre-Token Generation trigger to Customize Claims In ID Tokens.


About the Authors

Portrait StefanStefan Appel is a Senior Solutions Architect at AWS. For 10+ years, he supports enterprise customers adopt cloud technologies. Before joining AWS, Stefan held positions in software architecture, product management, and IT operations departments. He began his career in research on event-based systems. In his spare time, he enjoys hiking and has walked the length of New Zealand following Te Araroa.

Portrait ModoodModood Alvi is Senior Solutions Architect at Amazon Web Services (AWS). Modood is passionate about digital transformation and is committed helping large enterprise customers across the globe accelerate their adoption of and migration to the cloud. Modood brings more than a decade of experience in software development, having held various technical roles within companies like SAP and Porsche Digital. Modood earned his Diploma in Computer Science from the University of Stuttgart.

Improve collaboration between teams by using AWS CDK constructs

Post Syndicated from Joerg Woehrle original https://aws.amazon.com/blogs/devops/improve-collaboration-between-teams-by-using-aws-cdk-constructs/

There are different ways to organize teams to deliver great software products. There are companies that give the end-to-end responsibility for a product to a single team, like Amazon’s Two-Pizza teams, and there are companies where multiple teams split the responsibility between infrastructure (or platform) teams and application development teams. This post provides guidance on how collaboration efficiency can be improved in the case of a split-team approach with the help of the AWS Cloud Development Kit (CDK).

The AWS CDK is an open-source software development framework to define your cloud application resources. You do this by using familiar programming languages like TypeScript, Python, Java, C# or Go. It allows you to mix code to define your application’s infrastructure, traditionally expressed through infrastructure as code tools like AWS CloudFormation or HashiCorp Terraform, with code to bundle, compile, and package your application.

This is great for autonomous teams with end-to-end responsibility, as it helps them to keep all code related to that product in a single place and single programming language. There is no need to separate application code into a different repository than infrastructure code with a single team, but what about the split-team model?

Larger enterprises commonly split the responsibility between infrastructure (or platform) teams and application development teams. We’ll see how to use the AWS CDK to ensure team independence and agility even with multiple teams involved. We’ll have a look at the different responsibilities of the participating teams and their produced artifacts, and we’ll also discuss how to make the teams work together in a frictionless way.

This blog post assumes a basic level of knowledge on the AWS CDK and its concepts. Additionally, a very high level understanding of event driven architectures is required.

Team Topologies

Let’s first have a quick look at the different team topologies and each team’s responsibilities.

One-Team Approach

In this blog post we will focus on the split-team approach described below. However, it’s still helpful to understand what we mean by “One-Team” Approach: A single team owns an application from end-to-end. This cross-functional team decides on its own on the features to implement next, which technologies to use and how to build and deploy the resulting infrastructure and application code. The team’s responsibility is infrastructure, application code, its deployment and operations of the developed service.

If you’re interested in how to structure your AWS CDK application in a such an environment have a look at our colleague Alex Pulver’s blog post Recommended AWS CDK project structure for Python applications.

Split-Team Approach

In reality we see many customers who have separate teams for application development and infrastructure development and deployment.

Infrastructure Team

What I call the infrastructure team is also known as the platform or operations team. It configures, deploys, and operates the shared infrastructure which other teams consume to run their applications on. This can be things like an Amazon SQS queue, an Amazon Elastic Container Service (Amazon ECS) cluster as well as the CI/CD pipelines used to bring new versions of the applications into production.
It is the infrastructure team’s responsibility to get the application package developed by the Application Team deployed and running on AWS, as well as provide operational support for the application.

Application Team

Traditionally the application team just provides the application’s package (for example, a JAR file or an npm package) and it’s the infrastructure team’s responsibility to figure out how to deploy, configure, and run it on AWS. However, this traditional setup often leads to bottlenecks, as the infrastructure team will have to support many different applications developed by multiple teams. Additionally, the infrastructure team often has little knowledge of the internals of those applications. This often leads to solutions which are not optimized for the problem at hand: If the infrastructure team only offers a handful of options to run services on, the application team can’t use options optimized for their workload.

This is why we extend the traditional responsibilities of the application team in this blog post. The team provides the application and additionally the description of the infrastructure required to run the application. With “infrastructure required” we mean the AWS services used to run the application. This infrastructure description needs to be written in a format which can be consumed by the infrastructure team.

While we understand that this shift of responsibility adds additional tasks to the application team, we think that in the long term it is worth the effort. This can be the starting point to introduce DevOps concepts into the organization. However, the concepts described in this blog post are still valid even if you decide that you don’t want to add this responsibility to your application teams. The boundary of who is delivering what would then just move more into the direction of the infrastructure team.

To be successful with the given approach, the two teams need to agree on a common format on how to hand over the application, its infrastructure definition, and how to bring it to production. The AWS CDK with its concept of Constructs provides a perfect means for that.

Primer: AWS CDK Constructs

In this section we take a look at the concepts the AWS CDK provides for structuring our code base and how these concepts can be used to fit a CDK project into your team topology.

Constructs

Constructs are the basic building block of an AWS CDK application. An AWS CDK application is composed of multiple constructs which in the end define how and what is deployed by AWS CloudFormation.

The AWS CDK ships with constructs created to deploy AWS services. However, it is important to understand that you are not limited to the out-of-the-box constructs provided by the AWS CDK. The true power of AWS CDK is the possibility to create your own abstractions on top of the default constructs to create solutions for your specific requirement. To achieve this you write, publish, and consume your own, custom constructs. They codify your specific requirements, create an additional level of abstraction and allow other teams to consume and use your construct.

We will use a custom construct to separate the responsibilities between the the application and the infrastructure team. The application team will release a construct which describes the infrastructure along with its configuration required to run the application code. The infrastructure team will consume this construct to deploy and operate the workload on AWS.

How to use the AWS CDK in a Split-Team Setup

Let’s now have a look at how we can use the AWS CDK to split the responsibilities between the application and infrastructure team. I’ll introduce a sample scenario and then illustrate what each team’s responsibility is within this scenario.

Scenario

Our fictitious application development team writes an AWS Lambda function which gets deployed to AWS. Messages in an Amazon SQS queue will invoke the function. Let’s say the function will process orders (whatever this means in detail is irrelevant for the example) and each order is represented by a message in the queue.

The application development team has full flexibility when it comes to creating the AWS Lambda function. They can decide which runtime to use or how much memory to configure. The SQS queue which the function will act upon is created by the infrastructure team. The application team does not have to know how the messages end up in the queue.

With that we can have a look at a sample implementation split between the teams.

Application Team

The application team is responsible for two distinct artifacts: the application code (for example, a Java jar file or an npm module) and the AWS CDK construct used to deploy the required infrastructure on AWS to run the application (an AWS Lambda Function along with its configuration).

The lifecycles of these artifacts differ: the application code changes more frequently than the infrastructure it runs in. That’s why we want to keep the artifacts separate. With that each of the artifacts can be released at its own pace and only if it was changed.

In order to achieve these separate lifecycles, it is important to notice that a release of the application artifact needs to be completely independent from the release of the CDK construct. This fits our approach of separate teams compared to the standard CDK way of building and packaging application code within the CDK construct.

But how will this be done in our example solution? The team will build and publish an application artifact which does not contain anything related to CDK.
When a CDK Stack with this construct is synthesized it will download the pre-built artifact with a given version number from AWS CodeArtifact and use it to create the input zip file for a Lambda function. There is no build of the application package happening during the CDK synth.

With the separation of construct and application code, we need to find a way to tell the CDK construct which specific version of the application code it should fetch from CodeArtifact. We will pass this information to the construct via a property of its constructor.

For dependencies on infrastructure outside of the responsibility of the application team, I follow the pattern of dependency injection. Those dependencies, for example a shared VPC or an Amazon SQS queue, are passed into the construct from the infrastructure team.

Let’s have a look at an example. We pass in the external dependency on an SQS Queue, along with details on the desired appPackageVersion and its CodeArtifact details:

export interface OrderProcessingAppConstructProps {
    queue: aws_sqs.Queue,
    appPackageVersion: string,
    codeArtifactDetails: {
        account: string,
        repository: string,
        domain: string
    }
}

export class OrderProcessingAppConstruct extends Construct {

    constructor(scope: Construct, id: string, props: OrderProcessingAppConstructProps) {
        super(scope, id);

        const lambdaFunction = new lambda.Function(this, 'OrderProcessingLambda', {
            code: lambda.Code.fromDockerBuild(path.join(__dirname, '..', 'bundling'), {
                buildArgs: {
                    'PACKAGE_VERSION' : props.appPackageVersion,
                    'CODE_ARTIFACT_ACCOUNT' : props.codeArtifactDetails.account,
                    'CODE_ARTIFACT_REPOSITORY' : props.codeArtifactDetails.repository,
                    'CODE_ARTIFACT_DOMAIN' : props.codeArtifactDetails.domain
                }
            }),
            runtime: lambda.Runtime.NODEJS_16_X,
            handler: 'node_modules/order-processing-app/dist/index.lambdaHandler'
        });
        const eventSource = new SqsEventSource(props.queue);
        lambdaFunction.addEventSource(eventSource);
    }
}

Note the code lambda.Code.fromDockerBuild(...): We use AWS CDK’s functionality to bundle the code of our Lambda function via a Docker build. The only things which happen inside of the provided Dockerfile are:

  • the login into the AWS CodeArtifact repository which holds the pre-built application code’s package
  • the download and installation of the application code’s artifact from AWS CodeArtifact (in this case via npm)

If you are interested in more details on how you can build, bundle and deploy your AWS CDK assets I highly recommend a blog post by my colleague Cory Hall: Building, bundling, and deploying applications with the AWS CDK. It goes into much more detail than what we are covering here.

Looking at the example Dockerfile we can see the two steps described above:

FROM public.ecr.aws/sam/build-nodejs16.x:latest

ARG PACKAGE_VERSION
ARG CODE_ARTIFACT_AWS_REGION
ARG CODE_ARTIFACT_ACCOUNT
ARG CODE_ARTIFACT_REPOSITORY

RUN aws codeartifact login --tool npm --repository $CODE_ARTIFACT_REPOSITORY --domain $CODE_ARTIFACT_DOMAIN --domain-owner $CODE_ARTIFACT_ACCOUNT --region $CODE_ARTIFACT_AWS_REGION
RUN npm install order-processing-app@$PACKAGE_VERSION --prefix /asset

Please note the following:

  • we use --prefix /asset with our npm install command. This tells npm to install the dependencies into the folder which CDK will mount into the container. All files which should go into the output of the docker build need to be placed here.
  • the aws codeartifact login command requires credentials with the appropriate permissions to proceed. In case you run this on for example AWS CodeBuild or inside of a CDK Pipeline you need to make sure that the used role has the appropriate policies attached.

Infrastructure Team

The infrastructure team consumes the AWS CDK construct published by the application team. They own the AWS CDK Stack which composes the whole application. Possibly this will only be one of several Stacks owned by the Infrastructure team. Other Stacks might create shared infrastructure (like VPCs, networking) and other applications.

Within the stack for our application the infrastructure team consumes and instantiates the application team’s construct, passes any dependencies into it and then deploys the stack by whatever means they see fit (e.g. through AWS CodePipeline, GitHub Actions or any other form of continuous delivery/deployment).

The dependency on the application team’s construct is manifested in the package.json of the infrastructure team’s CDK app:

{
  "name": "order-processing-infra-app",
  ...
  "dependencies": {
    ...
    "order-app-construct" : "1.1.0",
    ...
  }
  ...
}

Within the created CDK Stack we see the dependency version for the application package as well as how the infrastructure team passes in additional information (like e.g. the queue to use):

export class OrderProcessingInfraStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);   

    const orderProcessingQueue = new Queue(this, 'order-processing-queue');

    new OrderProcessingAppConstruct(this, 'order-processing-app', {
       appPackageVersion: "2.0.36",
       queue: orderProcessingQueue,
       codeArtifactDetails: { ... }
     });
  }
}

Propagating New Releases

We now have the responsibilities of each team sorted out along with the artifacts owned by each team. But how do we propagate a change done by the application team all the way to production? Or asked differently: how can we invoke the infrastructure team’s CI/CD pipeline with the updated artifact versions of the application team?

We will need to update the infrastructure team’s dependencies on the application teams artifacts whenever a new version of either the application package or the AWS CDK construct is published. With the dependencies updated we can then start the release pipeline.

One approach is to listen and react to events published by AWS CodeArtifact via Amazon EventBridge. On each release AWS CodeArtifact will publish an event to Amazon EventBridge. We can listen to that event, extract the version number of the new release from its payload and start a workflow to update either our dependency on the CDK construct (e.g. in the package.json of our CDK application) or a update the appPackageVersion which the infrastructure team passes into the consumed construct.

Here’s how a release of a new app version flows through the system:

A release of the application package triggers a change and deployment of the infrastructure team's CDK Stack

Figure 1 – A release of the application package triggers a change and deployment of the infrastructure team’s CDK Stack

  1. The application team publishes a new app version into AWS CodeArtifact
  2. CodeArtifact triggers an event on Amazon EventBridge
  3. The infrastructure team listens to this event
  4. The infrastructure team updates its CDK stack to include the latest appPackageVersion
  5. The infrastructure team’s CDK Stack gets deployed

And very similar the release of a new version of the CDK Construct:

A release of the application team's CDK construct triggers a change and deployment of the infrastructure team's CDK Stack

Figure 2 – A release of the application team’s CDK construct triggers a change and deployment of the infrastructure team’s CDK Stack

  1. The application team publishes a new CDK construct version into AWS CodeArtifact
  2. CodeArtifact triggers an event on Amazon EventBridge
  3. The infrastructure team listens to this event
  4. The infrastructure team updates its dependency to the latest CDK construct
  5. The infrastructure team’s CDK Stack gets deployed

We will not go into the details on how such a workflow could look like, because it’s most likely highly custom for each team (think of different tools used for code repositories, CI/CD). However, here are some ideas on how it can be accomplished:

Updating the CDK Construct dependency

To update the dependency version of the CDK construct the infrastructure team’s package.json (or other files used for dependency tracking like pom.xml) needs to be updated. You can build automation to checkout the source code and issue a command like npm install sample-app-construct@NEW_VERSION (where NEW_VERSION is the value read from the EventBridge event payload). You then automatically create a pull request to incorporate this change into your main branch. For a sample on what this looks like see the blog post Keeping up with your dependencies: building a feedback loop for shared librares.

Updating the appPackageVersion

To update the appPackageVersion used inside of the infrastructure team’s CDK Stack you can either follow the same approach outlined above, or you can use CDK’s capability to read from an AWS Systems Manager (SSM) Parameter Store parameter. With that you wouldn’t put the value for appPackageVersion into source control, but rather read it from SSM Parameter Store. There is a how-to for this in the AWS CDK documentation: Get a value from the Systems Manager Parameter Store. You then start the infrastructure team’s pipeline based on the event of a change in the parameter.

To have a clear understanding of what is deployed at any given time and in order to see the used parameter value in CloudFormation I’d recommend using the option described at Reading Systems Manager values at synthesis time.

Conclusion

You’ve seen how the AWS Cloud Development Kit and its Construct concept can help to ensure team independence and agility even though multiple teams (in our case an application development team and an infrastructure team) work together to bring a new version of an application into production. To do so you have put the application team in charge of not only their application code, but also of the parts of the infrastructure they use to run their application on. This is still in line with the discussed split-team approach as all shared infrastructure as well as the final deployment is in control of the infrastructure team and is only consumed by the application team’s construct.

About the Authors

Picture of the author Joerg Woehrle As a Solutions Architect Jörg works with manufacturing customers in Germany. Before he joined AWS in 2019 he held various roles like Developer, DevOps Engineer and SRE. With that Jörg enjoys building and automating things and fell in love with the AWS Cloud Development Kit.
Picture of the author Mohamed Othman Mo joined AWS in 2020 as a Technical Account Manager, bringing with him 7 years of hands-on AWS DevOps experience and 6 year as System operation admin. He is a member of two Technical Field Communities in AWS (Cloud Operation and Builder Experience), focusing on supporting customers with CI/CD pipelines and AI for DevOps to ensure they have the right solutions that fit their business needs.

How to use granular geographic match rules with AWS WAF

Post Syndicated from Mohit Mysore original https://aws.amazon.com/blogs/security/how-to-use-granular-geographic-match-rules-with-aws-waf/

In November 2022, AWS introduced support for granular geographic (geo) match conditions in AWS WAF. This blog post demonstrates how you can use this new feature to customize your AWS WAF implementation and improve the security posture of your protected application.

AWS WAF provides inline inspection of inbound traffic at the application layer. You can use AWS WAF to detect and filter common web exploits and bots that could affect application availability or security, or consume excessive resources. Inbound traffic is inspected against web access control list (web ACL) rules. A web ACL rule consists of rule statements that instruct AWS WAF on how to inspect a web request.

The AWS WAF geographic match rule statement functionality allows you to restrict application access based on the location of your viewers. This feature is crucial for use cases like licensing and legal regulations that limit the delivery of your applications outside of specific geographic areas.

AWS recently released a new feature that you can use to build precise geographic rules based on International Organization for Standardization (ISO) 3166 country and area codes. With this release, you can now manage access at the ISO 3166 region level. This capability is available across AWS Regions where AWS WAF is offered and for all AWS WAF supported services. In this post, you will learn how to use this new feature with Amazon CloudFront and Elastic Load Balancing (ELB) origin types.

Summary of concepts

Before we discuss use cases and setup instructions, make sure that you are familiar with the following AWS services and concepts:

  • Amazon CloudFront: CloudFront is a web service that gives businesses and web application developers a cost-effective way to distribute content with low latency and high data transfer speeds.
  • Amazon Simple Storage Service (Amazon S3): Amazon S3 is an object storage service built to store and retrieve large amounts of data from anywhere.
  • Application Load Balancer: Application Load Balancer operates at the request level (layer 7), routing traffic to targets—Amazon Elastic Compute Cloud (Amazon EC2) instances, IP addresses, and Lambda functions—based on the content of the request.
  • AWS WAF labels: Labels contain metadata that can be added to web requests when a rule is matched. Labels can alter the behavior or default action of managed rules.
  • ISO (International Organization for Standardization) 3166 codes: ISO codes are internationally recognized codes that designate for every country and most of the dependent areas a two- or three-letter combination. Each code consists of two parts, separated by a hyphen. For example, in the code AU-QLD, AU is the ISO 3166 alpha-2 code for Australia, and QLD is the subdivision code of the state or territory—in this case, Queensland.

How granular geo labels work

Previously, geo match statements in AWS WAF were used to allow or block access to applications based on country of origin of web requests. With updated geographic match rule statements, you can control access at the region level.

In a web ACL rule with a geo match statement, AWS WAF determines the country and region of a request based on its IP address. After inspection, AWS WAF adds labels to each request to indicate the ISO 3166 country and region codes. You can use labels generated in the geo match statement to create a label match rule statement to control access.

AWS WAF generates two types of labels based on origin IP or a forwarded IP configuration that is defined in the AWS WAF geo match rule. These labels are the country and region labels.

By default, AWS WAF uses the IP address of the web request’s origin. You can instruct AWS WAF to use an IP address from an alternate request header, like X-Forwarded-For, by enabling forwarded IP configuration in the rule statement settings. For example, the country label for the United States with origin IP and forwarded IP configuration are awswaf:clientip:geo:country:US and awswaf:forwardedip:geo:country:US, respectively. Similarly, the region labels for a request originating in Oregon (US) with origin and forwarded IP configuration are awswaf:clientip:geo:region:US-OR and awswaf:forwardedip:geo:region:US-OR, respectively.

To demonstrate this AWS WAF feature, we will outline two distinct use cases.

Use case 1: Restrict content for copyright compliance using AWS WAF and CloudFront

Licensing agreements might prevent you from distributing content in some geographical locations, regions, states, or entire countries. You can deploy the following setup to geo-block content in specific regions to help meet these requirements.

In this example, we will use an AWS WAF web ACL that is applied to a CloudFront distribution with an S3 bucket origin. The web ACL contains a geo match rule to tag requests from Australia with labels, followed by a label match rule to block requests from the Queensland region. All other requests with source IP originating from Australia are allowed.

To configure the AWS WAF web ACL rule for granular geo restriction

  1. Follow the steps to create an Amazon S3 bucket and CloudFront distribution with the S3 bucket as origin.
  2. After the CloudFront distribution is created, open the AWS WAF console.
  3. In the navigation pane, choose Web ACLs, select Global (CloudFront) from the dropdown list, and then choose Create web ACL.
  4. For Name, enter a name to identify this web ACL.
  5. For Resource type, choose the CloudFront distribution that you created in step 1, and then choose Add.
  6. Choose Next.
  7. Choose Add rules, and then choose Add my own rules and rule groups.
  8. For Name, enter a name to identify this rule.
  9. For Rule type, choose Regular rule.
  10. Configure a rule statement for a request that matches the statement Originates from a Country and select the Australia (AU) country code from the dropdown list.
  11. Set the IP inspection configuration parameter to Source IP address.
  12. Under Action, choose Count, and then choose Add Rule.
  13. Create a new rule by following the same actions as in step 7 and enter a name to identify the rule.
  14. For Rule type, choose Regular rule.
  15. Configure a rule statement for a request that matches the statement Has a Label and enter awswaf:clientip:geo:region:AU-QLD for the match key.
  16. Set the action to Block and choose Add rule.
  17. For Actions, keep the default action of Allow.
  18. For Amazon CloudWatch metrics, select the AWS WAF rules that you created in steps 8 and 14.
  19. For Request sampling options, choose Enable sampled requests, and then choose Next.
  20. Review and create the web ACL rule.

After the web ACL is created, you should see the web ACL configuration, as shown in the following figures. Figure 1 shows the geo match rule configuration.

Figure 1: Web ACL rule configuration

Figure 1: Web ACL rule configuration

Figure 2 shows the Queensland regional geo restriction.

Figure 2: Queensland regional geo restriction - web ACL configuration<

Figure 2: Queensland regional geo restriction – web ACL configuration<

The setup is now complete—you have a web ACL with two regular rules. The first rule matches requests that originate from Australia and adds geographic labels automatically. The label match rule statement inspects requests with Queensland granular geo labels and blocks them. To understand where requests are originating from, you can configure logging on the AWS WAF web ACL.

You can test this setup by making requests from Queensland, Australia, to the DNS name of the CloudFront distribution to invoke a block. CloudFront will return a 403 error, similar to the following example.

$ curl -IL https://abcdd123456789.cloudfront.net
HTTP/2 403 
server: CloudFront
date: Tue, 21 Feb 2023 22:06:25 GMT
content-type: text/html
content-length: 919
x-cache: Error from cloudfront
via: 1.1 abcdd123456789.cloudfront.net (CloudFront)
x-amz-cf-pop: SYD1-C1

As shown in these test results, requests originating from Queensland, Australia, are blocked.

Use case 2: Allow incoming traffic from specific regions with AWS WAF and Application Load Balancer

We recently had a customer ask us how to allow traffic from only one region, and deny the traffic from other regions within a country. You might have similar requirements, and the following section will explain how to achieve that. In the example, we will show you how to allow only visitors from Washington state, while disabling traffic from the rest of the US.

This example uses an AWS WAF web ACL applied to an application load balancer in the US East (N. Virginia) Region with an Amazon EC2 instance as the target. The web ACL contains a geo match rule to tag requests from the US with labels. After we enable forwarded IP configuration, we will inspect the X-Forwarded-For header to determine the origin IP of web requests. Next, we will add a label match rule to allow requests from the Washington region. All other requests from the United States are blocked.

To configure the AWS WAF web ACL rule for granular geo restriction

  1. Follow the steps to create an internet-facing application load balancer in the US East (N. Virginia) Region.
  2. After the application load balancer is created, open the AWS WAF console.
  3. In the navigation pane, choose Web ACLs, and then choose Create web ACL in the US east (N. Virginia) Region.
  4. For Name, enter a name to identify this web ACL.
  5. For Resource type, choose the application load balancer that you created in step 1 of this section, and then choose Add.
  6. Choose Next.
  7. Choose Add rules, and then choose Add my own rules and rule groups.
  8. For Name, enter a name to identify this rule.
  9. For Rule type, choose Regular rule.
  10. Configure a rule statement for a request that matches the statement Originates from a Country in, and then select the United States (US) country code from the dropdown list.
  11. Set the IP inspection configuration parameter to IP address in Header.
  12. Enter the Header field name as X-Forwarded-For.
  13. For Match, choose Fallback for missing IP address. Web requests without a valid IP address in the header will be treated as a match and will be allowed.
  14. Under Action, choose Count, and then choose Add Rule.
  15. Create a new rule by following the same actions as in step 7 of this section, and enter a name to identify the rule.
  16. For Rule type, choose Regular rule.
  17. Configure a rule statement for a request that matches the statement Has a Label, and for the match key, enter awswaf:forwardedip:geo:region:US-WA.
  18. Set the action to Allow and add choose Add Rule.
  19. For Default web ACL action for requests that don’t match any rules, set the Action to Block.
  20. For Amazon CloudWatch metrics, select the AWS WAF rules that you created in steps 8 and 14 of this section.
  21. For Request sampling options, choose Enable sampled requests, and then choose Next.
  22. Review and create the web ACL rule.

After the web ACL is created, you should see the web ACL configuration, as shown in the following figures. Figure 3 shows the geo match rule

Figure 3: Geo match rule

Figure 3: Geo match rule

Figure 4 shows the Washington regional geo restriction.

Figure 4: Washington regional geo restriction - web ACL configuration

Figure 4: Washington regional geo restriction – web ACL configuration

The following is a JSON representation of the rule:

{
  "Name": "WashingtonRegionAllow",
  "Priority": 1,
  "Statement": {
    "LabelMatchStatement": {
      "Scope": "LABEL",
      "Key": "awswaf:forwardedip:geo:region:US-WA"
    }
  },
  "Action": {
    "Allow": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "USRegionalRestriction"
  }
}

The setup is now complete—you have a web ACL with two regular rules. The first rule matches requests that originate from the US after inspecting the origin IP in the X-Forwarded-For header, and adds geographic labels. The label match rule statement inspects requests with the Washington region granular geo labels and allows these requests.

If a user makes a web request from outside of the Washington region, the request will be blocked and a HTTP 403 error response will be returned, similar to the following.

curl -IL https://GeoBlock-1234567890.us-east-1.elb.amazonaws.com
HTTP/1.1 403 Forbidden
Server: awselb/2.0
Date: Tue, 21 Feb 2023 22:07:54 GMT
Content-Type: text/html
Content-Length: 118
Connection: keep-alive

Conclusion

AWS WAF now supports the ability to restrict traffic based on granular geographic labels. This gives you further control based on geographic location within a country.

In this post, we demonstrated two different use cases that show how this feature can be applied with CloudFront distributions and application load balancers. Note that, apart from CloudFront and application load balancers, this feature is supported by other origin types that are supported by AWS WAF, such as Amazon API Gateway and Amazon Cognito.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS WAF re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mohit Mysore

Mohit Mysore

Mohit is a Technical Account Manager with over 5 years of experience working with AWS Customers. He is passionate about network and system administration. Outside work, He likes to travel, watch soccer and F1 and spend time with his family.

Maintaining Code Quality with Amazon CodeCatalyst Reports

Post Syndicated from Imtranur Rahman original https://aws.amazon.com/blogs/devops/maintaining-code-quality-with-amazon-codecatalyst-reports/

Amazon CodeCatalyst reports contain details about tests that occur during a workflow run. You can create tests such as unit tests, integration tests, configuration tests, and functional tests. You can use a test report to help troubleshoot a problem during a workflow.

Introduction

In prior posts in this series, I discussed reading The Unicorn Project, by Gene Kim, and how the main character, Maxine, struggles with a complicated Software Development Lifecycle (SDLC) after joining a new team. One of the challenges she encounters is the difficulties in shipping secure, functioning code without an automated testing mechanism. To quote Gene Kim, “Without automated testing, the more code we write, the more money it takes for us to test.”

Software Developers know that shipping vulnerable or non-functioning code to a production environment is to be avoided at all costs; the monetary impact is high and the toll it takes on team morale can be even greater. During the SDLC, developers need a way to easily identify and troubleshoot errors in their code.

In this post, I will focus on how developers can seamlessly run tests as a part of workflow actions as well as configure unit test and code coverage reports with Amazon CodeCatalyst. I will also outline how developers can access these reports to gain insights into their code quality.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

As with the previous posts in the CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Three-tier blueprint.

Once the project is deployed, CodeCatalyst opens the project overview. This view shows the content of the README file from the project’s source repository, workflow runs, pull requests, etc. The source repository and workflow are created for me by the project blueprint. To view the source code, I select Code → Source Repositories from the left-hand navigation bar. Then, I select the repository name link from the list of source repositories.

Figure 1. List of source repositories including Mythical Mysfits source code.

Figure 1. List of source repositories including Mythical Mysfits source code.

From here I can view details such as the number of branches, workflows, commits, pull requests and source code of this repo. In this walkthrough, I’m focused on the testing capabilities of CodeCatalyst. The project already includes unit tests that were created by the blueprint so I will start there.

From the Files list, navigate to web → src → components→ __tests__ → TheGrid.spec.js. This file contains the front-end unit tests which simply check if the strings “Good”, “Neutral”, “Evil” and “Lawful”, “Neutral”, “Chaotic” have rendered on the web page. Take a moment to examine the code. I will use these tests throughout the walkthrough.

Figure 2. Unit test for the front-end that test strings have been rendered properly.

Figure 2. Unit test for the front-end that test strings have been rendered properly. 

Next, I navigate to the  workflow that executes the unit tests. From the left-hand navigation bar, select CI/CD → Workflows. Then, find ApplicationDeploymentPipeline, expand Recent runs and select  Run-xxxxx . The Visual tab shows a graphical representation of the underlying YAML file that makes up this workflow. It also provides details on what started the workflow run, when it started,  how long it took to complete, the source repository and whether it succeeded.

Figure 3. The Deployment workflow open in the visual designer.

Figure 3. The Deployment workflow open in the visual designer.

Workflows are comprised of a source and one or more actions. I examined test reports for the back-end in a prior post. Therefore, I will focus on the front-end tests here. Select the build_and_test_frontend action to view logs on what the action ran, its configuration details, and the reports it generated. I’m specifically interested in the Unit Test and Code Coverage reports under the Reports tab:

Figure 4. Reports tab showing line and branch coverage.

Figure 4. Reports tab showing line and branch coverage.

Select the report unitTests.xml (you may need to scroll). Here, you can see an overview of this specific report with metrics like pass rate, duration, test suites, and the test cases for those suites:

Figure 5. Detailed report for the front-end tests

Figure 5. Detailed report for the front-end tests.

This report has passed all checks.  To make this report more interesting, I’ll intentionally edit the unit test to make it fail. First, navigate back to the source repository and open web → src → components→ __tests__→TheGrid.spec.js. This test case is looking for the string “Good” so change it to say “Best” instead and commit the changes.

Figure 6. Front-End Unit Test Code Change.

Figure 6. Front-End Unit Test Code Change.

This will automatically start a new workflow run. Navigating back to CI/CD →  Workflows, you can see a new workflow run is in progress (takes ~7 minutes to complete).

Once complete, you can see that the build_and_test_frontend action failed. Opening the unitTests.xml report again, you can see that the report status is in a Failed state. Notice that the minimum pass rate for this test is 100%, meaning that if any test case in this unit test ever fails, the build fails completely.

There are ways to configure these minimums which will be explored when looking at Code Coverage reports. To see more details on the error message in this report, select the failed test case.

Figure 7. Failed Test Case Error Message.

Figure 7. Failed Test Case Error Message.

As expected, this indicates that the test was looking for the string “Good” but instead, it found the string “Best”. Before continuing, I return to the TheGrid.spec.js file and change the string back to “Good”.

CodeCatalyst also allows me to specify code and branch coverage criteria. Coverage is a metric that can help you understand how much of your source was tested. This ensures source code is properly tested before shipping to a production environment. Coverage is not configured for the front-end, so I will examine the coverage of the back-end.

I select Reports on the left-hand navigation bar, and open the report called backend-coverage.xml. You can see details such as line coverage, number of lines covered, specific files that were scanned, etc.

Figure 8. Code Coverage Report Succeeded.

Figure 8. Code Coverage Report Succeeded.

The Line coverage minimum is set to 70% but the current coverage is 80%, so it succeeds. I want to push the team to continue improving, so I will edit the workflow to raise the minimum threshold to 90%. Navigating back to CI/CD → Workflows → ApplicationDeploymentPipeline, select the Edit button. On the Visual tab, select build_backend. On the Outputs tab, scroll down to Success Criteria and change Line Coverage to 90%.

Figure 9. Configuring Code Coverage Success Criteria.

Figure 9. Configuring Code Coverage Success Criteria.

On the top-right, select Commit. This will push the changes to the repository and start a new workflow run. Once the run has finished, navigate back to the Code Coverage report. This time, you can see it reporting a failure to meet the minimum threshold for Line coverage.

Figure 10. Code Coverage Report Failed.

There are other success criteria options available to experiment with. To learn more about success criteria, see Configuring success criteria for tests.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Summary

In this post, I demonstrated how Amazon CodeCatalyst can help developers quickly configure test cases, run unit/code coverage tests, and generate reports using CodeCatalyst’s workflow actions. You can use these reports to adhere to your code testing strategy as a software development team. I also outlined how you can use success criteria to influence the outcome of a build in your workflow.  In the next post, I will demonstrate how to configure CodeCatalyst workflows and integrate Software Composition Analysis (SCA) reports. Stay tuned!

About the authors:

Imtranur Rahman

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform.Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Wasay Mabood

Wasay is a Partner Solutions Architect based out of New York. He works primarily with AWS Partners on migration, training, and compliance efforts but also dabbles in web development. When he’s not working with customers, he enjoys window-shopping, lounging around at home, and experimenting with new ideas.

How to monitor and query IAM resources at scale – Part 2

Post Syndicated from Michael Chan original https://aws.amazon.com/blogs/security/how-to-monitor-and-query-iam-resources-at-scale-part-2/

In this post, we continue with our recommendations for using AWS Identity and Access Management (IAM) APIs. In part 1 of this two-part series, we described how you could create IAM resources and use them soon after for authorization decisions. We also described options for monitoring and responding to IAM resource changes for entire accounts. Now, in part 2 of this post, we’ll cover the API throttling behavior of IAM and AWS Security Token Service (AWS STS) and how you can effectively plan your usage of these APIs. We’ll also cover the features of IAM that enable you to right-size the permissions granted to principals in your organization and assess external access to your resources.

Increase your usage of IAM APIs

If you’re a developer or a security engineer, you might find yourself writing more and more automation that interacts with IAM APIs. Other engineers, teams, or applications might also call the IAM APIs within the same account or cross-account. Over time, anyone calling the APIs in your account incrementally increases the number of requests per second. If so, IAM might send a “Rate exceeded” error that indicates you have exceeded a certain threshold of API calls per second. This is called API throttling.

Understand IAM API throttling

API throttling occurs when you exceed the call rate limits for an API. AWS uses API throttling to limit requests to a service. Like many AWS services, IAM limits API requests to maintain the performance of the service, and to ensure fair usage across customers. IAM and AWS STS independently implement a token bucket algorithm for throttling, in which a bucket of virtual tokens is refilled every second. Each token represents a non-throttled API call that you can make. The number of tokens that a bucket holds and the refill rate depends on the API. For each IAM API, a number of token buckets might apply.

We refer to this simply as rate-limiting criteria. Essentially, there are several rate-limiting criteria that are considered when evaluating whether a customer is generating more traffic than the service allows. The following are some examples of these criteria:

  • The account where the API is called
  • The account for read or write APIs (depending on whether the API is a read or write operation)
  • The account from which AssumeRole was called prior to the API call (for example, third-party cross-account calls)
  • The account from which AssumeRole was called prior to the API call for read APIs
  • The API and organization where the API is called

Understand STS API throttling

Although IAM has criteria pertaining to the account from which AssumeRole was called, IAM has its own API rate limits that are distinct from AWS STS. Therefore, the preceding criteria are IAM-specific and are separate from the throttling that can occur if you call STS APIs. IAM is also a global service, and the limits are not Region-aware. In contrast, while STS has a single global endpoint, every Region has its own STS endpoint with its own limits.

The STS rate-limiting criteria pertain to each account and endpoint for API calls. For example, if you call the AssumeRole API against the sts.ap-northeast-1.amazonaws.com endpoint, STS will evaluate the rate-limiting criteria associated with that account and the ap-northeast-1 endpoint. Other STS API requests that you perform under the same account and endpoint will also count towards these criteria. However, if you make a request from the same account to a different regional endpoint or the global endpoint, that request will count against different criteria.

Note: AWS recommends that you use the STS regional endpoints instead of the STS global endpoint. Regional endpoints have several benefits, including redundancy and reduced latency. To learn more about other benefits, see Managing AWS STS in an AWS Region.

How multiple criteria affect throttling

The preceding examples show the different ways that IAM and STS can independently limit requests. Limits might be applied at the account level and across read or write APIs. More than one rate-limiting criterion is typically associated with an API call, with each request counted against each rate-limiting criterion independently. This means that if the requests-per-second exceeds the applicable criteria, then API throttling occurs and returns a rate-limiting error.

How to address IAM and STS API throttling

In this section, we’ll walk you through some strategies to reduce IAM and STS API throttling.

Query for top callers

With AWS CloudTrail Lake, your organization can aggregate, store, and query events recorded by CloudTrail for auditing, security investigation, and operational troubleshooting. To monitor API throttling, you can run a simple query that identifies the top callers of IAM and STS.

For example, you can make a SQL-based query in the CloudTrail console to identify the top callers of IAM, as shown in Figure 1. This query includes the API count, API event name, and more that are made to IAM (shown under eventSource). In this example, the top result is a call to GetServiceLastAccessedDetails, which occurred 163 times. The result includes the account ID and principal ID that made those requests.

Figure 1: Example AWS CloudTrail Lake query

Figure 1: Example AWS CloudTrail Lake query

You can enable AWS CloudTrail Lake for all accounts in your organization. For more information, see Announcing AWS CloudTrail Lake – a managed audit and security Lake. For sample queries, including top IAM actions by principal, see cloud-trail-lake-query-samples in our aws-samples GitHub repository.

Know when you exceed API call rate limits

To reduce call throttling, you need to know when you exceed a rate limit. You can identify when you are being throttled by catching the RateLimitExceeded exception in your API calls. Or, you can send your application logs to Amazon CloudWatch Logs and then configure a metric filter to record each time that throttling occurs, for later analysis or notification. Ideally, you should do this across your applications, and log this information centrally so that you can investigate whether calls from a specific account (such as your central monitoring account) are affecting API availability across your other accounts by exceeding a rate-limiting criterion in those accounts.

Call your APIs with a less aggressive retry strategy

In the AWS SDKs, you can use the existing retry library and provide a custom base for the initial sleep done between API calls. For example, you can set a custom configuration for the backoff or edit the defaults directly. The default SDK_DEFAULT_THROTTLED_BASE_DELAY is 500 milliseconds (ms) in the relevant Java SDK file, but if you’re experiencing throttling consistently, we recommend a minimum 1000 ms for the throttled base delay. You can change this value or implement a custom configuration through the PredefinedBackoffStrategies.SDKDefaultBackoffStrategy() class, which is referenced in the same file. As another example, in the Javascript SDK, you can edit the base retry of the retryDelayOptions configuration in the AWS.Config class, as described in the documentation.

The difference between making these changes and using the SDK defaults is that the custom base provides a less aggressive retry. You shouldn’t retry multiple requests that are throttled during the same one-second window. If the API has other applicable rate-limiting criteria, you can potentially exceed those limits as well, preventing other calls in your account from performing requests. Lastly, be careful that you don’t implement your own retry or backoff logic on top of the SDK retry or backoff logic because this could make throttling worse — instead, you should override the SDK defaults.

Reduce the number of requests by using max items

For some APIs, you can increase the number of items returned by a single API call. Consider the example of the GetServiceLastAccessedDetails API. This API returns a lot of data, but the results are truncated by default to 100 items, ordered alphabetically by the service namespace. If the number of items returned is greater than 100, then the results are paginated, and you need to make multiple requests to retrieve the paginated results individually. But if you increase the value of the MaxItems parameter, you can decrease the number of requests that you need to make to obtain paginated results.

AWS has hundreds of services, so you should set the value of the MaxItems parameter no higher than your application can handle (the response size could be large). At the time of our testing, the results were no longer truncated when this value was 300. For this particular API, IAM might return fewer results, even when more results are available. This means that your code still needs to check whether the results are paginated and make an additional request if paginated results are available.

Consistent use of the MaxItems parameter across AWS APIs can help reduce your total number of API requests. The MaxItems parameter is also available through the GetOrganizationsAccessReport operation, which defaults to 100 items but offers a maximum of 1000 items, with the same caveat that fewer results might be returned, so check for paginated results.

Smooth your high burst traffic

In the table from part 1 of this post, we stated that you should evaluate IAM resources every 24 hours. However, if you use a simple script to perform this check, you could initiate a throttling event. Consider the following fictional example:

As a member of ExampleCorp’s Security team, you are working on a task to evaluate the company’s IAM resources through some custom evaluation scripts. The scripts run in a central security account. ExampleCorp has 1000 accounts. You write automation that assumes a role in every account to run the GetAccountAuthorizationDetails API call. Everything works fine during development on a few accounts, but you later build a dashboard to graph the data. To get the results faster for the dashboard, you update your code to run concurrently every hour. But after this change, you notice that many requests result in the throttling error “Rate exceeded.” Other security teams see “Rate exceeded” errors in their application logs, too.

Can you guess what happened? When you tried to make the requests faster, you used concurrency to make the requests run in parallel. By initiating this large number of requests simultaneously, you exceeded the rate-limiting criteria for the security account from which the sts:AssumeRole action was called prior to the GetAccountAuthorizationDetails API call.

To address scenarios like this, we recommend that you set your own client-side limitations when you need to make a large number of API requests. You can spread these calls out so that they happen sequentially and avoid large spikes. For example, if you run checks every 24 hours, make sure that the calls don’t happen at exactly midnight. Figure 2 shows two different ways to distribute API volume over time:

Figure 2: Call volume that periodically spikes compared to evenly-distributed call volume

Figure 2: Call volume that periodically spikes compared to evenly-distributed call volume

The graph on the left represents a large, recurring API call volume, with calls occurring at roughly the same time each day—such as 1000 requests at midnight every 24 hours. However, if you intend to make these 1000 requests consistently every 24 hours, you can spread them out over the 24-hour period. This means that you could make about 41 requests every hour, so that 41 accounts are queried at 00:00 UTC, another 41 the next hour, and so on. By using this strategy, you can make these requests blend into your other traffic, as shown in the graph on the right, rather than create large spikes. In summary, your automation scheduler should avoid large spikes and distribute API requests evenly over the 24-hour period. Using queues such as those provided by Amazon Simple Queue Service (Amazon SQS) can also help, and when errors are identified, you can put them in a dead letter queue to try again later.

Some IAM APIs have rate-limiting criteria for API requests made from the account from which the AssumeRole was called prior to the call. We recommend that you serially iterate over the accounts in your organization to avoid throttling. To continue with our example, you should iterate the 41 accounts one-by-one each hour, rather than running 41 calls at once, to reduce spikes in your request rates.

Recommendations specific to STS

You can adjust how you use AWS STS to reduce your number of API calls. When you write code that calls the AssumeRole API, you can reuse the returned credentials for future requests because the credentials might still be valid. Imagine that you have an event-driven application running in a central account that assumes a role in a target account and does an API call for each event that occurs in that account. You should consider reusing the credentials returned by the AssumeRole call for each subsequent call in the target account, especially if calls in the central accounts are being throttled. You can do this for AssumeRole calls because there is no service-side limit to the number of credentials that you can create and use. Whether it’s one credential or many, you need to use and store these carefully. You can also adjust the role session duration, which determines how long the role’s credentials are valid. This value can be up to 12 hours, depending on the maximum session duration configured on the role. If you reuse short-term credentials or adjust the session duration, make sure that you evaluate these changes from a security perspective as you optimize your use of STS to reduce API call volume.

Use case #3: Pare down permissions for least-privileged access

Let’s assume that you want to evaluate your organization’s IAM resources with some custom evaluation scripts. AWS has native functionality that can reduce your need for a custom solution. Let’s take a look at some of these that can help you accomplish these goals.

Identify unintended external sharing

To identify whether resources in your accounts, such as IAM roles and S3 buckets, have been shared with external entities, you can use IAM Access Analyzer instead of writing your own checks. With IAM Access Analyzer, you can identify whether resources are accessible outside your account or even your entire organization. Not only can you identify these resources on-demand, but IAM Access Analyzer proactively re-analyzes resources when their policies change, and reports new findings. This can help you feel confident that you will be notified of new external sharing of supported resources, so that you can act quickly to investigate. For more details, see the IAM Access Analyzer user guide.

Right-size permissions

You can also use IAM Access Analyzer to help right-size the permissions policies for key roles in your accounts. IAM Access Analyzer has a policy generation feature that allows you to generate a policy by analyzing your CloudTrail logs to identify actions used from over 140 services. You can compare this generated policy with the existing policy to see if permissions are unused, and if so, remove them.

You can perform policy generation through the API or the IAM console. For example, you can use the console to navigate to the role that you want to analyze, and then choose Generate policy to start analyzing the actions used over a specified period. Actions that are missing from the generated policy are permissions that can be potentially removed from the existing policy, after you confirm your changes with those who administer the IAM role. To learn more about generating policies based on CloudTrail activity, see IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity.

Conclusion

In this two-part series, you learned more about how to use IAM so that you can test and query IAM more efficiently. In this post, you learned about the rate-limiting criteria for IAM and STS, to help you address API throttling when increasing your usage of these services. You also learned how IAM Access Analyzer helps you identify unintended resource sharing while also generating policies that serve as a baseline for principals in your account. In part 1, you learned how to quickly create IAM resources and use them when refining permissions. You also learned how to get information about IAM resources and respond to IAM changes through the various services integrated with IAM. Lastly, when calling IAM directly, you learned about bulk APIs, which help you efficiently retrieve the state of your principals and policies. We hope these posts give you valuable insights about IAM to help you better monitor, review, and secure access to your AWS cloud environment!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Chan

Michael Chan

Michael is a Senior Solutions Architect for AWS Identity who has advised financial services and global customers of AWS. He enjoys understanding customer problems with identity and access management and helping them solve their security issues at scale.

Author photo

Joshua Du Lac

Josh is a Senior Manager of Security Solutions Architects at AWS. Based out of Texas, he has advised dozens of enterprise, global, and financial services customers to accelerate their journey to the cloud while improving their security along the way. Outside of work, Josh enjoys searching for the best tacos in Texas and practicing his handstands.

How to monitor and query IAM resources at scale – Part 1

Post Syndicated from Michael Chan original https://aws.amazon.com/blogs/security/how-to-monitor-and-query-iam-resources-at-scale-part-1/

In this two-part blog post, we’ll provide recommendations for using AWS Identity and Access Management (IAM) APIs, and we’ll share useful details on how IAM works so that you can use it more effectively. For example, you might be creating new IAM resources such as roles and policies through automation and notice a delay for resource propagations. Or you might be building a custom cloud security monitoring solution that uses IAM APIs to evaluate the security and compliance of your AWS accounts, and you want to know how to do that without exceeding limits. Although these are just a few example use cases, the insights described in this post are intended to help you avoid anti-patterns when building scalable cloud services that use IAM APIs.

In this post, we describe how to create IAM resources and use them soon after for authorization decisions. We also describe options for monitoring and responding to IAM resource changes for entire accounts. In part 2, we’ll cover the API throttling behavior of IAM and AWS Security Token Service (AWS STS) and how you can effectively plan your usage of these APIs. Let’s dive in!

Use case 1: Create IAM resources and attempt to use them immediately

If you’re a cloud developer, you create and use IAM resources when you develop applications on AWS. For your application to interact with AWS services, you need to grant IAM permissions to your application. Your application—whether it runs on AWS Lambda, Amazon Elastic Compute Cloud (Amazon EC2), or another service—will need an associated IAM role and policy that provide the necessary permissions.

Imagine that you want to create least privilege policies for your application. You begin by deploying new or updated IAM resources, such as roles and policies, along with your application updates, and you automate this process to speed up testing and development.

During development, you begin removing unnecessary policy permissions, with your automation testing the updated permissions. However, you notice that some of your updates do not immediately take effect. The following sections address why this occurs and provide insights to help you architect for other scenarios.

Understand the IAM control plane and data plane

Let’s first learn more about the control plane and data plane in IAM. The control plane involves operations to create, read, update, and delete IAM resources, and it’s how you get the current state of IAM. When you invoke IAM APIs, you interact with the control plane. This includes any API that falls under the iam:* namespace. The data plane, in contrast, consists of the authorization system that is used at scale to grant access to the broader set of AWS services and resources. This includes the AWS STS APIs, which have their own sts:* namespace.

When you call the IAM control plane APIs to create, update, or delete resources, you can expect a read-after-write consistent response. This means that you can retrieve (read) the resource and its latest updates immediately after it’s written. In contrast, the IAM data plane, where authorizations occur, is eventually consistent. This means that there will be a delay for IAM resource changes, such as updates to roles and policies, to propagate and reflect in the authorizations that follow. The delay can be several seconds or longer. Because of this, you need to allow for propagation time when you test changes to IAM resources. To learn more about the control plane and data plane of IAM, see Resilience in AWS Identity and Access Management.

Note: Because calls to AWS APIs rely on IAM to check permissions, the availability and scalability of the data plane are paramount. In 2011, the “can the caller do this?” function handled a couple of thousand requests per second. Today, as new services continue to launch and the number of AWS customers increases, AWS Identity handles over half a billion API calls per second worldwide, and the number is growing. Eventually consistent design enables the IAM data plane to maintain the high availability and low latency needed to evaluate permissions on AWS.

This is why when architecting your application, we recommend that you don’t depend on control plane actions such as resource updates for critical parts of your application’s workflow. Instead, you should architect to take advantage of the data plane, which includes STS and the authorization system of IAM. In the next section, we describe how you can do this.

Test permissions with STS scope-down policies

IAM role sessions have a feature called a session policy, which takes effect immediately when a role is assumed. This is an optional policy that you can provide to scope down the role’s existing identity policies, with the permissions being the intersection of the role’s identity-based policies and the session policy. By using session policies, you get specific, scoped-down credentials from a single pre-existing role without having to create new roles or identity policies for each particular session’s use case. You can use session policies for your application or when you test which least privilege policies are best for your application.

Let’s walk through an example of when to use session policies for permissions testing. Imagine that you need permissions that require very specific, fine-grained conditions to attain your ideal least privilege policy. You might iterate on the policy several times, making updates and testing the changes over and over again. If you update a policy attached to a role, you need to wait for these changes to propagate to the IAM data plane. But if you instead specify a scope-down policy when assuming the pre-existing role prior to testing, you can immediately test and observe the effects of your permissions changes. Immediate testing is possible because your role and its original policy have already propagated to the data plane, enabling you to iterate over various scoped-down session policies that operate against the IAM data plane.

Use STS session policies to assume a role with the AWS CLI

There are two ways to provide a session policy during the AssumeRole process: you can provide an inline policy document or the Amazon Resource Names (ARNs) of managed session policies. The following example shows how to do this through the AWS Command Line Interface (AWS CLI), by passing in a policy document along with the AssumeRole call. If you use this example policy, make sure to replace <123456789012> and <DOC-EXAMPLE-BUCKET> with your own information.

$ aws sts assume-role \
 --role-arn arn:aws:iam::<123456789012>:role/s3-full-access
 --role-session-name getobject-only-exco
 --policy '{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject" ], "Effect": "Allow", "Resource": "arn:aws:s3::: <DOC-EXAMPLE-BUCKET>/*" } ] }'

In this example, we provide a previously created role ARN named s3-full-access, which provides full access to Amazon Simple Storage Service (Amazon S3). We can further restrict the role’s permissions by supplying a policy with the optional --policy option. The inline policy document only allows the GetObject request against the S3 bucket named <DOC-EXAMPLE-BUCKET>. The effective permissions for the returned session are the intersection of the role’s identity-based policies and our provided session policy. Therefore, the role session’s permissions are limited to only performing the GetObject request against the <DOC-EXAMPLE-BUCKET>.

Note: The combined size of the passed inline policy document and all passed managed policy ARN characters cannot exceed 2,048 characters. You can reduce the size of the JSON policy document by removing unnecessary whitespace and shortening or removing tags associated with your session.

To learn more about session permissions, see Create fine-grained session permissions using IAM managed policies. In part two of this post, we will describe how you can use role sessions when you need to provide credentials at a high rate.

Use case 2: Monitor and respond to IAM resources for entire accounts

You might need to periodically audit the state of your IAM resources, such as roles and policies, including whether these IAM resources have changed, in a single account or across your entire organization. For example, you might want to check whether roles have overly broad access to actions and resources. Or you might want to monitor IAM resource creation and updates to respond to security-relevant permission changes. In this section, you will learn how to choose the right tool for auditing and monitoring IAM resources across accounts. You will learn about the AWS services that support this use case, the benefits of polling compared to event-based architectures, and powerful APIs that aggregate common information.

Respond to configuration changes with an event-driven approach

Sometimes you might need to perform actions relatively quickly based on IAM changes. For example, you might need to check if a trust policy for a newly created or updated role allows cross-account access. In cases like this, you can use AWS Config rules, AWS CloudTrail, or Amazon EventBridge to detect state changes and perform actions based on these state changes. You can use AWS Config rules to evaluate whether a resource complies with the conditions that you specify. If it doesn’t comply, you can provide a workflow to remediate the non-compliance. With CloudTrail, you can monitor your account’s API calls, and log API calls for your accounts with AWS Organizations integration. EventBridge works closely with CloudTrail and helps you create rules that match incoming events and send them to targets, such as Lambda, where your code can perform analysis or automated remediation. You can even filter out events from your accounts and send them to a central account’s event bus for processing. For an example of how to use EventBridge with IAM Access Analyzer to remediate cross-account access in a role’s trust policy, see Automate resolution for IAM Access Analyzer cross-account access findings on IAM roles. Which feature you choose depends on whether you need to monitor one account or all accounts in your organization, as well as which solution you are more comfortable building with.

One caveat to an event-driven approach is that if many events occur over a short period and your application responds to each event with an IAM API call of its own, you could eventually be throttled by IAM. To address this, you can queue up your responding API calls, distribute them over a longer period, or aggregate them to reduce API call volume. For example, if some of your calls are write APIs (such as UpdateAssumeRolePolicy or CreatePolicyVersion) or read APIs (such as GetRole or GetRolePolicy), you can call them serially with a delay between calls. If you need the latest status on a large number of principals and policies, you can call IAM bulk APIs such as GetAccountAuthorizationDetails, which will return data to you for principals and policies and their relationships in your organization. This approach helps you avoid throttling and querying the IAM control plane with unnecessary and redundant API calls. You will learn more about throttling and how to address it in part two of this post.

Retrieve point-in-time resource information with AWS Config

AWS Config helps you assess, audit, and evaluate the configuration of your AWS resources. It also offers multi-account, multi-Region data aggregation and is integrated with AWS Organizations. With AWS Config, you can create rules that detect and respond to changes. AWS Config also keeps an inventory of AWS resource configurations that you can query through its API, so that you don’t need to make direct API calls to each resource’s service. AWS Config also offers the ability to return the status of resources from multiple accounts and AWS Regions. As shown in Figure 1, you can use the AWS Config console to run a simple SQL-like statement for details on the IAM roles in your entire organization.

Figure 1: Run a query on IAM roles in AWS Config

Figure 1: Run a query on IAM roles in AWS Config

The preceding results also show associated resources, such as the inline and attached policies for the IAM roles. Alternatively, you can obtain these results from the SDK or CLI. The following query that uses the CLI is equivalent to the preceding query that uses the console. If you use this query, make sure to replace DOC-EXAMPLE-CONFIG-AGGREGATOR> with your AWS Config aggregator.

aws configservice select-aggregate-resource-config
--configuration-aggregator-name <DOC-EXAMPLE-CONFIG-AGGREGATOR>
--expression "SELECT accountId, resourceId, resourceName, resourceType, tags, configuration.attachedManagedPolicies, configuration.rolePolicyList WHERE resourceType = 'AWS::IAM::Role'"

Here is the response (note that we’ve adjusted the formatting to make it more readable):

{
  "accountId": "123456789012",
  "resourceId": "AROAI3X5HCEQIIEXAMPLE",
  "configuration": { 
    "attachedManagedPolicies": [
      {     
        "policyArn": "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
        "policyName": "AWSLambdaBasicExecutionRole"
      },    
      {     
        "policyArn": "arn:aws:iam::123456789012:policy/mchan-test-cloudtrail-post-to-SNS",
        "policyName": "mchan-test-cloudtrail-post-to-SNS"
      }     
      ],    
    "rolePolicyList": []
  },
  "resourceName": "lambda-cloudtrail-notifications",
  "tags": [],
  "resourceType": "AWS::IAM::Role"
}

The preceding command returns the details of roles in your organization’s accounts, including the full policy document for the associated inline policy. It also returns the customer-managed policy names and their ARNs, for which you can view the policy documents and versions by using the BatchGetResourceConfig API. Note that AWS Config doesn’t provide the AWS-managed policy documents. However, these are common across accounts, and we will show you how to query that data later in this section.

To query the status of roles in your organization, you need to have AWS Config enabled in each account. You also need an aggregator to monitor your accounts with your organization’s management account or a delegated administrator account. For more details on how to set up AWS Config, see the AWS Config developer guide. After you set up AWS Config, you can periodically call the AWS Config APIs to get a snapshot of the current or prior state of your resources. Furthermore, you can periodically pull the snapshot records and evaluate this information in other tools outside of AWS Config. So before you directly use the IAM APIs to get IAM information, consider using AWS Config—this is what it’s for!

Retrieve IAM resource information directly from IAM

As previously noted, AWS Config can give you a bulk view of your AWS and IAM resources. Additionally, CloudTrail and EventBridge can detect AWS and IAM resource changes and help you act on them. If you need data from IAM beyond what these services offer, you can query the IAM APIs directly to get the latest information on your resources.

A few key APIs can help you audit IAM resources more efficiently, especially in bulk. The first is GetAccountAuthorizationDetails, which enables you to retrieve the principals in your account, their associated inline policy documents (if any), attached managed policies, and their relationships to each other. This API reduces the need to individually call ListRolePolicies and ListAttachedRolePolicies for each role in an account. GetAccountAuthorizationDetails also returns the role trust policy document for roles in the results. Finally, GetAccountAuthorizationDetails allows you to filter the result set. For example, if you don’t need information relating to groups or AWS managed policies, you can exclude these from the API response. You can do this by using the filter parameter to only include the details that you need at the time.

Another useful API is GenerateServiceLastAccessedDetails. This API gives you details about when an IAM resource (user, group, role, or policy) was last used in an attempt to access AWS services. You can use this API to identify roles that are unused and remove them if you don’t need them. IAM Access Analyzer, which you will learn about later in this post, also uses the same information.

The following table summarizes the key APIs that you can use, rather than building your own code that loops for this information individually.

Type of information API How to use the API Frequency of use
User list and user detail GetAccountAuthorizationDetails Pass User to the filter parameter When needed, per account
User’s inline policy User’s inline policy GetAccountAuthorizationDetails Pass User to the filter parameter When needed, per account
User’s attached managed policies GetAccountAuthorizationDetails Pass User to the filter parameter When needed, per account
Role list and role detail GetAccountAuthorizationDetails Pass Role to the filter parameter When needed, per account
Role trust policy GetAccountAuthorizationDetails Pass Role to the filter parameter When needed, per account
Role’s inline policy GetAccountAuthorizationDetails Pass Role to the filter parameter When needed, per account
Role’s attached managed policies GetAccountAuthorizationDetails Pass Role to the filter parameter When needed, per account
Role last used GetAccountAuthorizationDetails Pass Role to the filter parameter When needed, per account
Group list and group detail GetAccountAuthorizationDetails Pass Group to the filter parameter When needed, per account
Group’s inline policy GetAccountAuthorizationDetails Pass Group to the filter parameter When needed, per account
Group’s attached managed policies GetAccountAuthorizationDetails Pass Group to the filter parameter When needed, per account
AWS customer managed policies GetAccountAuthorizationDetails Pass LocalManagedPolicy to the filter parameter When needed, per account
AWS managed policies GetAccountAuthorizationDetails Pass LocalManagedPolicy to the filter parameter 24 hours recommended, globally (once for all accounts within an AWS partition)
Policy versions GetAccountAuthorizationDetails Pass either LocalManagedPolicy or WSManagedPolicy to the filter parameter 24 hours recommended, per account
Services access attempts by an IAM resource GetServiceLastAccessedDetails Submit a job through the GenerateServiceLastAccessedDetails API, which returns a JobId; then retrieve the results after the job completes. Spread total number of requests evenly across 24 hours
Actions access attempts by an IAM resource GetServiceLastAccessedDetails Submit a job through the GenerateServiceLastAccessedDetails API which returns a JobId; then retrieve the results after the job completes. Pass ACTION_LEVEL as the required Granularity parameter. Spread total number of requests evenly across 24 hours

Note: In the table, we suggest that you perform some of these API requests once every 24 hours as a starting point. You might prefer to perform your own analysis at a longer time interval, such as every 48 hours, but we don’t recommend requesting it more often than every 24 hours because these resources (and therefore the details in the responses) don’t change often. These APIs are suitable for periodic, point-in-time collection of information. If you need faster detection of information from GetAccountAuthorizationDetails, consider whether AWS Config rules or EventBridge will fit your needs. For GetServiceLastAccessedDetails, recent activity usually appears within four hours, so more frequent requests are unlikely to provide much value.

Use of these APIs can help you avoid writing code that loops through results to make individual read API calls for each principal, policy, and policy version in an account, which could result in tens of thousands of API requests and call throttling. Instead of iterating over each resource, you should use solutions that return bulk data, such as GetAccountAuthorizationDetails, AWS Config, or an AWS Partner Network solution. However, if you’re experiencing throttling, you will learn some practical considerations on how to handle that later in this post.

Inspect IAM resources across multiple accounts and organizations

Your use case might require that you inspect IAM resources across multiple accounts in your organization. Or perhaps you are an independent software vendor and need to build a software-as-a-service tool to evaluate IAM resources across many organizations. The following considerations can help you address use cases like these.

AWS Organizations integration

Previously, you learned of the benefits of the “service last accessed data” that the GenerateServiceLastAccessedDetails and GetServiceLastAccessedDetails APIs provide. But what if you want to pull this data for multiple accounts in your organization? IAM has bulk APIs that support querying this data across your entire organization, so you don’t need to assume a role in each account to generate the request. To generate a report for entities (organization root, organizational unit, or account) or policies in your organization, use the GenerateOrganizationsAccessReport operation, which returns a JobId that is passed as a parameter to the GetOrganizationsAccessReport operation to check if the report has been generated. When the job status is marked complete, you can retrieve the report.

AWS managed policies

Many customers use AWS managed policies because they align to common job functions. AWS creates and administers these policies, which have their own ARNs, such as arn:aws:iam::aws:policy/AWSCodeCommitPowerUser. AWS managed policies are available for every account, and they are the same for every account. AWS updates them when new services and API operations are introduced. Updated policies are recorded and visible as a new version, so you only need to query for the current AWS managed policies once per evaluation cycle, rather than once per account. Therefore, if you’re evaluating hundreds or thousands of accounts, you shouldn’t include the AWS managed policies and their policy versions in your query. Doing so would result in thousands of redundant API requests and could cause throttling. Instead, you can query the AWS managed policies once and then reuse the results across your analysis and evaluation by caching the results for a period of time (for example, every 24 hours) in your application before requesting them again to check for updates. Because AWS managed policies are available through the GetAccountAuthorizationDetails API, you don’t need to query for the AWS managed policies or their versions as a separate action.

Multi-account limits

The preceding table lists the frequency of API requests as “per account” in many places. If you’re calling IAM APIs by assuming a role in other accounts from a central account, some IAM APIs have rate-limiting criteria that apply to API requests performed from the assuming account (the central account). To query data from multiple accounts, we recommend that you serially iterate over the accounts one-by-one to avoid throttling. You’ll learn more about this strategy, as well as throttling, in part 2 of this blog post.

Conclusion

In this post, you learned about different aspects of IAM and best practices to test and query IAM efficiently. With STS session policies, you can test different policies to help achieve least privilege access. With AWS Config, EventBridge, CloudTrail, and CloudTrail Lake, you can audit your IAM resources and respond to changes while reducing the number of IAM API calls that you make. If you need to call IAM directly, you can use IAM bulk APIs for more efficient retrieval of your resource state. You can learn more about IAM and best practices in part 2 of blog post: How to monitor and query IAM resources at scale – Part 2.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Chan

Michael Chan

Michael is a Senior Solutions Architect for AWS Identity who has advised financial services and global customers of AWS. He enjoys understanding customer problems with identity and access management and helping them solve their security issues at scale.

Author photo

Joshua Du Lac

Josh is a Senior Manager of Security Solutions Architects at AWS. Based out of Texas, he has advised dozens of enterprise, global, and financial services customers to accelerate their journey to the cloud while improving their security along the way. Outside of work, Josh enjoys searching for the best tacos in Texas and practicing his handstands.