All posts by Raghavarao Sodabathina

Federate access to Amazon SageMaker Unified Studio with AWS IAM Identity Center and Ping Identity

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/federate-access-to-amazon-sagemaker-unified-studio-with-aws-iam-identity-center-and-ping-identity/

With an identity provider (IdP), you can manage your user identities outside of AWS and give these external user identities permissions to use AWS resources in your AWS accounts. External IdPs, such as Ping Identity, can integrate with AWS IAM Identity Center to be the source of truth for Amazon SageMaker Unified Studio. SageMaker Unified Studio also supports trusted identity propagation for SQL analytics, including Amazon Athena and Amazon Redshift.

SageMaker Unified Studio provides an integrated experience to use your data and tools for analytics and AI. You can use SageMaker Unified Studio to discover your data and put it to work using familiar AWS analytics and machine learning (ML) services for model development, generative AI, big data processing, and SQL analytics, assisted by Amazon Q Developer. By default, SageMaker domains support AWS Identity and Access Management (IAM) user credentials. You can also enable access to SageMaker domains in SageMaker Unified Studio for users with single sign-on (SSO) with IAM Identity Center and direct SAML integration with SageMaker Unified Studio.

Users can access SageMaker Unified Studio with their existing corporate credentials. With IAM Identity Center, administrators can connect their existing external IdPs and continue to manage users and groups in those existing identity systems, which can then be synchronized with IAM Identity Center using System for Cross-domain Identity Management (SCIM).In this post, we show how to set up workforce access with SageMaker Unified Studio using Ping Identity as an external IdP with IAM Identity Center.

In this post, we show how to set up workforce access with SageMaker Unified Studio using Ping Identity as an external IdP with IAM Identity Center.

Solution overview

We walk through the following high-level steps to implement this solution:

  1. Enable IAM Identity Center.
  2. Create a SageMaker Unified Studio domain.
  3. Set up your IdP (for this example, Ping Identity).
  4. Connect Ping Identity and IAM Identity Center.
  5. Set up automatic provisioning of users and groups in IAM Identity Center.
  6. Configure SageMaker Unified Studio SSO user access.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account with IAM Identity Center enabled. It is recommended to use an organization-level IAM Identity Center instance for best practices and centralized identity management across your AWS organization.
  • A Ping Identity account.
  • A browser with network connectivity to Ping Identity and SageMaker Unified Studio.

Enable IAM Identity Center

To enable IAM Identity Center, follow the instructions in Enable IAM Identity Center.

Create a SageMaker Unified Studio domain

To create a SageMaker Unified Studio domain, refer to the instructions in Create a Amazon SageMaker Unified Studio domain – manual setup.

On the SageMaker console, go to the domain details and copy the Amazon Resource Name (ARN) under Domain ARN. You will use this value when you add your trust policy and when you connect your IAM IdP to your Ping Identity instance.

Create a SageMaker Unified Studio domain

Set up your IdP (Ping Identity)

In this section, we walk through the procedure to set up your IdP (for this example, Ping Identity).

Create an environment in Ping Identity

Complete the following steps to create an environment for Ping Identity:

  1. Log in to your Ping Identity account.
  2. Choose Create Environment.
  3. Choose Create a Customer Solution.
  4. In the Tailor your experiences pop-up, choose Skip.
    Create an environment in Ping Identity

Create a group in Ping Identity

Complete the following steps to create a group in Ping Identity:

  1. On the Environments page, choose Manage Environments.
  2. In the navigation pane, choose Directory, then choose Groups.
  3. Choose the plus sign to add a group.
  4. For Group Name, enter sagemaker
  5. For Description, enter an optional description (for example, Amazon SageMaker Unified Studio).
  6. For Population, choose Default.
  7. Choose Save.
    Create a group in Ping Identity
  8. On the Roles tab for the sagemaker group, assign the Environment Admin role to the group.
    Assigning roles for the sagemaker group

Create a user in Ping Identity

Complete the following steps to create a user:

  1. In the navigation pane, choose Directory, then choose Users.
  2. Choose the plus sign to create a user.
  3. Provide values for Given name, Family name, Username, and Email.
  4. For Password, choose First time password.
  5. Choose Save.

You can add more users as needed.

Assign group to user

Complete the following steps to assign your group to your user:

  1. In the navigation pane, choose Directory, then choose Groups.
  2. Choose the sagemaker group you created.
  3. On the Users tab, choose the plus sign to add a user.
  4. Add the user you created.

Connect Ping Identity and IAM Identity Center

To configure the integration between Ping Identity and IAM Identity Center, you need access to both management consoles. Although Ping Identity’s application catalog includes IAM Identity Center, we recommend configuring a standard SAML application for greater control over settings and attribute mappings.

Complete the following steps:

  1. Go to the Ping Identity environment you created and choose Applications in the navigation pane.
  2. Choose the plus sign to add an application:
    1. For Application name, enter a name (for this example, we use unifiedstudio).
    2. For Description, enter an optional description.
    3. For Application Type, choose SAML Application.
    4. Choose Configure.

    Creating a SAML app integration in Ping Identity

  3. Sign in to the IAM Identity Center console as a user with administrative privileges.
  4. In the navigation pane, choose Settings to update your settings:
    1. On the Identity source tab, choose Change identity source on the Actions dropdown menu.
      Selecting identity source in AWS IAM Identity Center
    2. For Choose identity source, select External identity provider, then choose Next.

      Choosing External Identity provider in AWS IAM Identity Center

    3. In the Service provider metadata section, choose Download metadata file to download the IAM Identity Center metadata file.

      You will use this service provider metadata file in the next step when you connect Ping Identity with IAM Identity Center.

    Downloading service provider metadata from AWS IAM Identity Center

  5. Return to the Ping Identity console and the SAML application page.
  6. In the SAML Configuration section, select Import Metadata, upload the metadata file you downloaded, then choose Save.

    Importing service provider metadata into Ping Identity

  7. On the Overview tab of the application page, choose Download Metadata under Connection details to download the Ping Identity IdP metadata.
    You will use this for the SAML configuration in IAM Identity Center to set up Ping Identity as an IdP in the next step.

    Downloading Identity provider metadata from Ping Identity

  8. Return to the IAM Identity Center console and continue configuring your identity source:
    1. In the Identity provider metadata section, choose Choose file under IdP SAML metadata, upload the metadata file you downloaded from Ping Identity, then choose Next.

      Configuring Ping Identity as Identity Provider in AWS IAM Identity Center

    2. Choose Accept to accept the disclaimer.
    3. Choose Change identity source.
  9. Return to the Ping Identity console to complete the SAML configuration.
  10. On the Configuration tab, choose the edit icon to update the configuration:
    1. For Sign, choose Sign Assertion & Response.
    2. For Subject Name ID, enter urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress.
    3. For Assertion Validity Duration, enter 300.
    4. Leave the remaining values as default.

    Ping Identity SAML Configurations

  11. On the Attributes tab, choose the edit icon.
  12. Choose +Add to add two attribute mappings:
    1. Map the attribute saml-subject to Username, and leave Name format as default.
    2. Map the attribute https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email to Email Address, and set Name format to Unspecified.
    3. Choose Save.

    Ping Identity SAML attributes mapping

  13. On the PingOne Policies tab, select Single Factor, then choose Save.
    This post uses single-factor authentication for demonstration purposes only. In your environments, follow your organization’s security standards and governance framework.

    Ping Identity policy configuration

  14. On the Access tab, search for the sagemaker group under Group Membership Policy, and assign the unifiedstudio SAML application to the group.
  15. Enable the application.
    Enabling Ping Identity SMAL application

Set up automatic provisioning of users and groups from Ping Identity into IAM Identity Center

To configure the automatic provisioning of users and groups between Ping Identity and IAM Identity Center through SCIM, you must have access to both management consoles. Complete the following steps:

  1. On the IAM Identity Center console, choose Settings in the navigation pane.
  2. In the Automatic provisioning section, choose Enable.
    Enabling automatic provisioning in AWS IAM Identity Center

    This enables automatic provisioning in IAM Identity Center and displays the necessary SCIM endpoint and access token information.

  3. In the Inbound automatic provisioning dialog box, copy the values for SCIM endpoint and Access token, then choose Close.
    You will use these values to configure provisioning in Ping Identity in the next step.

    Automatic provisioning configuration parameters in IAM Identity Center

    This completes the setup process in IAM Identity Center.

  4. Log in to the Ping Identity console.
  5. In the navigation pane, choose Integrations, then choose Provisioning.
  6. Choose the plus sign to add a new connection.
    Creating a new SCIM connection
  7. For Choose a connection type, choose Select next to Identity Store.
    Choosing connection type
  8. Provide a name (for this example, we use Identitycenter) and an optional description, then choose Next.
    Creating new connection
  9. Under Configuration Authentication, provide the following configuration:
    1. For SCIM BASE URL, enter the SCIM endpoint from IAM Identity Center.
    2. For Authentication Method, choose OAuth 2 Bearer Token.
    3. For Oauth Access Token, enter the access token from IAM Identity Center.
    4. For Auth Type Header, choose Bearer (default option).
    5. Choose Test Connection to validate the connection between Ping Identity and IAM Identity Center, then choose Next.

    Configuring authentication between Ping Identity and IAM Identity Center

  10. Under Configuration Preference, provide the following configuration:
    1. For User Filter Expression, enter userName Eq “%s”.
    2. For Group Membership Handling, select Merge.
    3. Leave the remaining settings as default and choose Save.

    SCIM connection preferences

  11. On the Provisioning tab, choose the plus sign, then choose New Rule to create a rule for the SCIM connection.
    Creating a new SCIM rule
  12. Enter a name (for this example, unifiedstudio) and an optional description, then choose Create Rule.
  13. Under the newly created rule, choose the plus sign next to Available Connections to add the connection identitycenter, then choose Save.
  14. Edit the user filter:
    1. For Attribute, choose Enabled.
    2. For Operator, choose Equals.
    3. For Value, choose true.
    4. Choose Save.

    User Filter attributes mapping

  15. Choose the edit icon next to Attribute Mapping and set the attribute mappings as shown in the following screenshot:
    1. Delete the Primary Phone attribute mapping because it’s optional in AWS. Leaving this field blank can cause Ping Identity’s SCIM connector to generate errors during user provisioning.
    2. Add a new attribute called Username under PingOne Directory and then map to displayName under Identitycenter.

    Attributes mapping between Ping Identity SCIM and AWS IAM Identity Center

  16. Under Group Provisioning, choose the sagemaker group if you want to sync all sagemaker group users with auto provisioning.
    1. In the pop-up, select I understand and want to continue, then choose Save.

    Assigning groups to SCIM rule

    Assigning groups to SCIM rule

  17. On the Provisioning page, choose the Connections tab.
  18. Enable the SCIM connection Identitycenter and rule unifiedstudio.

    Enabling the SCIM connection

    Enabling the SCIM rule

This completes the SCIM setup process between Ping Identity and IAM Identity Center.

Configure SageMaker Unified Studio SSO user access

Complete the following steps to configure SSO user access to SageMaker Unified Studio for your SageMaker domain:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain for which you want to configure SAML user access.
  3. On the domain details page, you can find the SSO configuration in two locations:
    1. From the main domain view, choose Configure next to Configure SSO user access.
    2. Alternatively, scroll down to the User management tab and choose Configure SSO user access.

    SageMaker Unified Studio SSO configuration

  4. On the Choose user authentication method page, select IAM Identity Center, then choose Next.
    Choosing authentication
  5. For Choose user and group assignment method, choose from the following options, then choose Next:
    1. Require assignments: Users and groups must be explicitly added to the domain to gain access. This provides more granular control over who can access the domain.
    2. Do not require assignments: All authorized Ping Identity users and groups can access this domain if they have been assigned to the SAML application in Ping Identity.

    For either option, users or groups must have access to the Ping Identity SAML application (unifiedstudio in this example) to authenticate successfully.

    SageMaker Unified Studio SAML configuration

  6. On the Review and save page, review your choices and choose Save. These settings can’t be changed after you save them.
    Review and confirm SAML configuration
  7. If you’ve chosen to require assignments, use the Add users and groups section to add SAML users and groups to your domain.
    Add users and groups to SageMaker Unified Studio domain

Now, users will be able to access SageMaker Unified Studio using the domain URL with their SSO credentials.

You can explore different projects for your users and assign those projects based on your IdP user groups for fine-grained access controls. For example, you can create different SAML user groups based on their job function in Ping Identity, then assign those Ping Identity groups to the unifiedstudio SAML application in Ping Identity, and then assign those Ping Identity SAML groups to their respective project profiles in SageMaker Unified Studio. To assign project profiles for their respective groups, choose the Project profiles tab and choose your project profile. On the Authorized users and groups page, choose Add, then choose SSO groups. Choose Add users and groups button to complete the project profile assignment.

Assigning a project profile to Ping Identity group

Validate access with Ping Identity users

Complete the following steps to validate access:

  1. On the SageMaker domain details page, choose the link for the SageMaker Unified Studio URL.
    Validating Ping Identity user access with Amazon SageMaker Unified Studio
  2. Log in with your user credentials.
    After successful login, you will be redirected to the SageMaker Unified Studio home page. Here, you can explore different projects to your users and assign those projects based on your SAML user groups for fine-grained access control.

    SAML authenticated Amazon SageMaker Unified Studio

  3. To assign an authorization policy, those Govern and then Domain units.
  4. Choose your SageMaker domain, then choose a suitable authorization policy. For this example, we choose Project creation policy.
    Amazon SageMaker unified studio authorization policies
  5. Choose Add policy grant to assign user groups or users to their respective project profiles.
    Amazon SageMaker unified studio authorization policies assignment

You have successfully federated SageMaker Unified Studio with Ping Identity as an IdP with IAM Identity Center. You can connect to SageMaker Unified Studio by using your Ping Identity credentials.

Clean up

After you test out this solution, remember to delete the resources you created to avoid incurring future charges. For instructions to delete your SageMaker Unified Studio domain, refer to Delete domains. If you want to delete your Ping Identity account, reach out to Ping Identity for assistance.

Conclusion

In this post, we demonstrated how to set up Ping Identity as an IdP over SAML authentication for SageMaker Unified Studio access through IAM Identity Center federation. To learn more, refer to the Amazon SageMaker Unified Studio User Guide, which provides guidance on how to build data and AI applications using SageMaker.


About the authors

Raghavarao Sodabathina

Raghavarao Sodabathina

Raghavarao is a Principal Solutions Architect at AWS, focusing on data analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Matt Nispel

Matt Nispel

Matt is an Enterprise Solutions Architect at AWS. He has more than 10 years of experience building cloud architectures for large enterprise companies. At AWS, Matt helps customers rearchitect their applications to take full advantage of the cloud. Matt lives in Minneapolis, Minnesota, and in his free time enjoys spending time with friends and family.

Himanshu Sarda

Himanshu Sarda

Himanshu is a Solutions Architect at AWS who specializes in generative AI and autonomous agent architectures, helping enterprise customers revolutionize their businesses through cutting-edge AI solutions. When not pioneering AI innovations, Himanshu recharges by exploring the outdoors and creating memories with family and friends.

Nicholaus Lawson

Nicholaus Lawson

Nicholaus is a Solutions Architect at AWS and part of the AI/ML specialty group. He has a background in software engineering and AI research. Outside of work, Nicholaus is often coding, learning something new, or woodworking.

Krupanidhi Jay

Krupanidhi Jay

Krupanidhi is a Boston-based Enterprise Solutions Architect at AWS. He is a seasoned architect with over 20 years of experience in helping customers with digital transformation and delivering seamless digital user experiences. He enjoys working with customers to help them build scalable, cost-effective solutions in AWS. Outside of work, Jay enjoys spending time with family and traveling.

Federate access to SageMaker Unified Studio with AWS IAM Identity Center and Okta

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/federate-access-to-sagemaker-unified-studio-with-aws-iam-identity-center-and-okta/

Many organizations are using an external identity provider to manage user identities. With an identity provider (IdP), you can manage your user identities outside of AWS and give these external user identities permissions to use AWS resources in your AWS accounts. External identity providers (IdP), such as Okta Universal Directory, can integrate with AWS IAM Identity Center to be the source of truth for Amazon SageMaker Unified Studio.

Amazon SageMaker Unified Studio supports a single sign-on (SSO) experience with AWS IAM Identity Center authentication. Users can access Amazon SageMaker Unified Studio with their existing corporate credentials. AWS IAM Identity Center enables administrators to connect their existing external identity providers and allows them to manage users and groups in their existing identity systems such as Okta which can then be synchronized with AWS IAM Identity Center using SCIM (System for Cross-domain Identity Management).

This post shows step-by-step guidance to setup workforce access to Amazon SageMaker Unified Studio using Okta as an external Identity provider with AWS IAM Identity Center.

Prerequisites

Before you start , make sure you have:

  1. An AWS account with AWS IAM Identity Center enabled . It is recommended to use an organization-level AWS IAM Identity Center instance for best practices and centralized identity management across your AWS organization.
  2. Okta account with users and a group
  3. A browser with network connectivity to Okta and Amazon SageMaker Unified Studio

Solution Overview

The steps in this post are structured into the following sections:

  1. Enable AWS IAM Identity Center
  2. Create an Amazon SageMaker domain
  3. Setup Okta users and groups
  4. Configure SAML in Okta for AWS IAM Identity Center
  5. Configure Okta as an identity provider in AWS IAM Identity Center
  6. Connect AWS IAM Identity Center to Okta
  7. Set up automatic provisioning of users and groups in AWS IAM Identity Center
  8. Complete Okta Configuration
  9. Configure Amazon SageMaker Unified Studio for SSO
  10. Test the setup
  11. Cleanup

Enable AWS IAM Identity Center

To enable AWS IAM Identity Center, follow the instructions in Enable IAM Identity Center in the AWS IAM Identity Center User Guide.

Create an Amazon SageMaker domain

  1. Sign into the AWS Management console and navigate to the Amazon SageMaker console. To create a new Amazon SageMaker Unified Studio domain follow the instructions in Create a Amazon SageMaker Unified Studio domain – manual setup
  2. From the Amazon SageMaker domain Summary page, copy the Domain ARN and save the value as shown Figure 1 for later use.

Screenshot of Amazon SageMaker domain summary page showing Domain ARN field
Figure 1: Amazon SageMaker Domain

Setup Okta users and groups

Step 1: Sign up for an Okta account

  • Sign up for an Okta account, then choose the Sign up button to complete your account setup.
  • If you already have an account with Okta, login to your Okta account.

Step 2: Create Groups in Okta

  • Choose Directory in the left menu and choose Groups to proceed.
  • Click on Add Group and enter name as unifiedstudio. Then choose the Save button.

Screenshot of Okta group creation interface with unifiedstudio group name entered
Figure 2. Creating a group in Okta

Step 3: Create users in Okta

  • Choose People in left menu under Directory section and choose +Add Person.
  • Provide First name, Last name, username (email ID), and primary email. Then select I will set password and choose first time password. Use the Save button to create your user.
  • Add more users as needed.

Step 4: Assign Groups to users

  • Choose Groups from the left menu, then choose the unifiedstudio group created in Step 2.
  • Use Assign People to add users to the sagemaker group. Next, use + for each user you want to add.

Configure SAML In Okta

  1. Login to your okta domain and choose Applications from the left menu. Choose Applications, then choose Browse App Catalog
  2. In the search box, enter AWS IAM Identity Center, then choose the app to add the AWS IAM Identity Center app and then, choose + Add Integration button.
    The following image shows the SAML app integration setup:
    Screenshot of Okta application catalog showing AWS IAM Identity Center app selection
    Figure 3. Creating a SAML app integration in Okta
  3. For this example, we are creating an application called “unifiedstudio”. Under General Settings: Required enter the following
    • Application label = Replace IAM Identity Center with unifiedstudio and then, choose Save
  4. Under Sign on menu. Copy Metadata URL under SAML 2.0 section and then, open Metadata URL in a new browser window to download the Okta identity provider metadata and save it as metadata.xml. You will use this for the SAML configuration in AWS IAM Identity Center to setup Okta as an Identity Provider.The following image shows where to find the metadata URL:

    Screenshot of Okta SAML settings showing metadata URL
    Figure 4: Downloading Okta identity provider metadata for SAML configuration

  5. Choose More details and copy Sign on URL into text file; you will use this for the SAML configuration in Amazon SageMaker Unified Studio.

You are now ready to move to the AWS IAM Identity Center console to create an identity provider integration for your Okta instance.

Configure Okta as an identity provider in AWS IAM Identity Center

  1. Sign in to the AWS IAM Identity Center console as a user with administrative privileges
  2. In the left navigation menu, choose Settings and then, open the Identity source tab, choose Change Identity source from Actions dropdown as shown in Figure 5
    Screenshot of AWS IAM Identity Center settings page showing Change Identity source optionFigure 5: Selecting identity source in AWS IAM Identity Center
  3. From Under Identity source, choose External Identity provider as shown in Figure 6
    Screenshot showing External Identity provider selection in AWS IAM Identity Center
    Figure 6: Choosing External Identity provider in AWS IAM Identity Center
  4. You’ll need these configuration parameters for the next step. In Configure external identity provider section, under Service Provider metadata, do the following:
    • Choose Download metadata file to download the AWS IAM Identity Center metadata file and save it on your system
    • Copy these Service Provider metadata into a text file
      1. IAM Identity Center Assertion Consumer Service (ACS) URL
      2. IAM Identity Center issuer URL
  5. In Identity provider metadata section, under Idp SAML metadata, click on choose file and upload the metadata.xml file which you downloaded from okta in the previous step and then, choose Next as shown in Figure 7

    Screenshot of AWS IAM Identity Center external identity provider configuration showing metadata file upload

    Figure 7. Configuring okta as Identity Provider in AWS IAM Identity Center

  6. After you read the disclaimer and are ready to proceed, enter ACCEPT and then choose Change identity source to complete Okta as an Identity Provider in IAM Identity Center.

Connect AWS IAM Identity Center to Okta

  1. Sign into Okta and go to the admin console.
  2. In the left navigation pane, choose Applications, and then choose the Okta application called unifiedstudio which you created in the previous section
  3. In Sign On, choose Edit to complete SAML configuration. Under Advanced Sign-on Settings enter the following and then, choose Save to complete configuration as shown Figure 8.
    1. For the AWS SSO ACS URL, enter IAM Identity Center Assertion Consumer Service (ACS) URL
    2. For the AWS SSO issuer URL, enter IAM Identity Center issuer URL
    3. For the Application username format, choose Okta username from dropdown

Screenshot of Okta advanced sign-on settings showing AWS SSO configuration fieldsFigure 8. Configuring okta sign-on settings

Set up automatic provisioning of users and groups

In the AWS IAM Identity Center console, on the Settings page, locate the Automatic provisioning information box, and then choose Enable as shown in Figure 9. Copy these values to enable automatic provisioning.

Screenshot of AWS IAM Identity Center automatic provisioning enable option

Figure 9. Enabling automatic provisioning in AWS IAM Identity Center

In the Inbound automatic provisioning dialog box, copy each of the values for the following options as shown in Figure 10 and then, choose Close

    • SCIM endpoint
    • Access token

You will use these values to configure provisioning in Okta in the next step.

Screenshot of AWS IAM Identity Center inbound automatic provisioning dialog showing SCIM endpoint and access tokenFigure 10. Automatic provisioning configuration parameters in AWS IAM Identity Center

Complete the Okta integration

  1. Sign into Okta and go to the admin console.
  2. In the left navigation pane, choose Applications, and then choose the Okta application called unifiedstudio which you created earlier.
  3. In Provisioning tab, choose Edit to complete auto provisioning between okta and AWS IAM Identity Center.
    • Under Settings, choose Integration and then, choose Configure API integration and then, select Enable API integration to enable provisioning and enter the following using the SCIM provisioning values from AWS IAM Identity Center that you copied from the previous step as shown in Figure 11

      For the Base URL, enter SCIM endpoint from IAM Identity Center
      For the API Token, enter Access token from IAM Identity Center
      For Import Groups, select Import groups option

    And then, choose Test API Credentials to validate the SCIM provision and then, choose Save.

    Screenshot of Okta provisioning settings showing API integration configuration with SCIM endpoint and token fields

    Figure 11: Automatic provisioning configuration in Okta

  4. In the Provisioning tab, in the navigation pane under Settings, choose To App in the left navigation. Choose Edit, to Enable all options such as Create Users , Update User Attributes , Deactivate Users as shown in Figure 12 and then, choose Save.

    Screenshot of Okta provisioning To App settings showing user management options

    Figure 12: Enabling Automatic provisioning configuration in Okta

  5. In the Assignments tab, choose Assign, and then Assign to Groups.
    • Select the unifiedstudio group, choose Assign, and then, leave it to defaults on popup and then, choose Done to complete the Group assignment, as shown in Figure 13.

    Screenshot of Okta group assignment interface showing unifiedstudio group selectionFigure 13: Assigning unifiedstudio group to SAML application called unifiedstudio

  6. In the Push Groups tab, under Push Groups drop-down list, select Find groups by name as shown in Figure 14.

    Screenshot of Okta Push Groups interface showing Find groups by name option

    Figure 14: Choosing okta groups to push them to AWS IAM Identity Center

    • Select the unifiedstudio group, leave Push group memberships immediately default option and then, choose Save as shown in Figure 15.

    Screenshot of Okta push groups settings showing unifiedstudio group configuration

    Figure 15: Pushing okta groups to AWS IAM Identity Center

Return to AWS IAM Identity Center, and you should be able to see Okta group and Okta users in AWS IAM Identity Center groups and users as shown In Figure 16.

Screenshot of AWS IAM Identity Center showing Okta users and groups synchronized from external identity provider

Figure 16: Okta user groups in AWS IAM Identity Center

Configure SageMaker Unified Studio for SSO

In this step, you will configure SSO user access to Amazon SageMaker Unified Studio for your Amazon SageMaker platform domain.

  1. Navigate to the Amazon SageMaker management console.
  2. In the left navigation menu, select Domains.
  3. Choose the Domain from the list for which you want to configure SAML user access.
  4. On the domain’s details page, choose Configure next to the Configure SSO user access.
    Screenshot of Amazon SageMaker domain details page showing Configure SSO user access option
    Figure 17: Amazon SageMaker Unified Studio SSO configuration
  5. On the Choose user authentication method page, choose IAM Identity Center. With IAM Identity Center, users configured through external Identity Providers (IdPs) get to access the domain’s Amazon SageMaker Unified Studio. Choose Next.
    Screenshot of SageMaker authentication method selection showing IAM Identity Center option
    Figure 18: Choosing authentication
  6. You can choose either Require assignments – which means you explicitly select users/groups that can access the domain or Do not require assignments – which allows all authorized Okta users and groups access to this domain.
    1. You have two options to configure how your users will access to Amazon SageMaker Unified studio with AWS IAM Identity Center federation with Okta
      • Do not required Assignments – The access will be provided to Amazon SageMaker Unified Studio based on your Okta SAML application assignments either through Group assignments or Individual user assignments. For this example, when you choose Do not required assignments option, all the users within unifiedstudio Okta group will have access to Amazon SageMaker Unified Studio as we have assigned unifiedstudio Okta user group to unifiedstudio SAML application in Okta.
      • Require Assignments – You need to add either Okta users or Okta group to Amazon SageMaker domain as shown in step 8. In step 8, you’ll add unifiedstudio Okta group into Amazon SageMaker domain so that all unifiedstudio Okta group users will get access to Amazon SageMaker Unified Studio. You can also provide an Individual Okta group users access to Amazon SageMaker unified studio through Amazon SageMaker domain console by adding SSO (okta user) user into the domain.
    2. Note that either an Individual user or group within Okta must be assigned to the AWS Identity center application (AWS IAM Identity Center from Okta application catalog. We renamed application label as unifiedstudio for this example) for both Do not require Assignments and Require Assignments options.

    Screenshot of SageMaker Unified Studio SAML configuration showing assignment options

    Figure 19. Amazon SageMaker Unified Studio SAML configuration

  7. On the Review and save page, review your choices and then choose Save. Note that these settings are permanent once saved.

    Screenshot of SageMaker SAML configuration review and save page

    Figure 20. Review and confirm SAML configuration

  8. If you’ve chosen to require assignments, use the Add users and groups to add SAML users and groups to your domain.

    Screenshot of SageMaker domain showing Add users and groups interface for Okta group assignment

    Figure 21. Adding okta group into Amazon Sagemaker domain

  9. Now, users will be able to access the Amazon SageMaker Unified Studio using the Domain URL with their SSO credentials.
  10. You can explore different projects for your users and assign those projects based on your SAML user groups for fine-grained access controls. For example, you can create different SAML user groups based on their job function in Okta, assign those Okta groups to AWS IAM Identity Center app in Okta and then, assign those Okta SAML groups to respective project profiles in Amazon SageMaker Unified Studio. To perform project profiles assignments to respective groups, choose project profiles tab, click on respective project profiles like SQL analytics, choose Authorized users and groups tab and then, choose Add and pick SSO groups from drop down as shown in Figure 22. Finally choose Add users and groups to complete project profile assignment.

    Screenshot of SageMaker Unified Studio project profile assignment interface showing SSO groups selection

    Figure 22. Assigning a project profile to okta group

Test the setup

  1. The Amazon SageMaker Unified Studio URL can be found on the domain details page as shown in Figure 23. The first access to Amazon SageMaker Unified Studio URL redirects you to the Okta login screen.
    Screenshot of SageMaker domain details page showing the Unified Studio URL for user access

    Figure 23. Validating Okta user access with Amazon SageMaker Unified Studio

  2. Copy and paste the Amazon SageMaker Unified Studio URL in your browser and enter the user credentials.
  3. After successful login, you will be redirected to the Amazon SageMaker Unified Studio home page.

    Screenshot of Amazon SageMaker Unified Studio home page after successful SAML authentication

    SAML authenticated Amazon SageMaker Unified Studio

    Figure 24. SAML authenticated Amazon SageMaker Unified Studio

  4. Once logged into Amazon SageMaker Unified Studio, you can assign authorization policies based on your requirements. Choose Govern and then choose, Domain units and choose your SageMaker domain to select suitable authorization policies. For this example, we are choosing project creation policy as shown in Figure 25.

    Amazon SageMaker unified studio authorization policies

    Screenshot of SageMaker Unified Studio authorization policies interface showing project creation policy selection
    Figure 25. Amazon SageMaker unified studio authorization policies

  5. Choose Project membership policy and then choose ADD POLICY GRANT option to assign user groups or users to respective project. For this example, we are choosing project membership policy as shown in Figure 26.

    Amazon SageMaker unified studio authorization policies assignment

    Screenshot of SageMaker Unified Studio policy grant assignment interface for project membership

    Figure 26. Amazon SageMaker unified studio authorization policies assignment

You’ve now successfully configured single sign-on for Amazon SageMaker Unified Studio using Okta credentials through AWS IAM Identity Center.

Clean up

To avoid ongoing charges, delete the resources you created:

Conclusion

In this post, we showed you how to set up Okta as an identity provider using SAML authentication for Amazon SageMaker Unified Studio access through AWS IAM Identity Center federation. This setup allows your users to access SageMaker Unified Studio with their existing corporate credentials, eliminating the need for separate AWS accounts.

Get started by checking the Amazon SageMaker Unified Studio Developer Guide, which provides guidance on how to build data and AI applications using Amazon SageMaker platform


About the authors

Raghavarao Sodabathina

Raghavarao Sodabathina

Raghavarao is a principal solutions architect at AWS, focusing on data analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Matt Nispel

Matt Nispel

Matt is an Enterprise Solutions Architect at AWS. He has more than 10 years of experience building cloud architectures for large enterprise companies. At AWS, Matt helps customers rearchitect their applications to take full advantage of the cloud. Matt lives in Minneapolis, Minnesota, and in his free time enjoys spending time with friends and family.

Nicholaus Lawson

Nicholaus Lawson

Nicholaus is a Solution Architect at AWS and part of the AIML specialty group. He has a background in software engineering and AI research. Outside of work, Nicholaus is often coding, learning something new, or woodworking.

Jacob Grant

Jacob Grant

Jacob is a Solutions Architect at AWS, based in Atlanta, Georgia, with over four years of AWS experience. He is currently focused on helping HCLS customers build innovative solutions. Jacob has a passion for building solutions in the Machine Learning and Artificial Intelligence domain and has helped customers integrate agentic features into their workloads. Outside of work, Jacob enjoys spending time with his wife and their two young daughters, embracing family adventures whenever possible.

Architectural Patterns for real-time analytics using Amazon Kinesis Data Streams, Part 2: AI Applications

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/architectural-patterns-for-real-time-analytics-using-amazon-kinesis-data-streams-part-2-ai-applications/

Welcome back to our exciting exploration of architectural patterns for real-time analytics with Amazon Kinesis Data Streams! In this fast-paced world, Kinesis Data Streams stands out as a versatile and robust solution to tackle a wide range of use cases with real-time data, from dashboarding to powering artificial intelligence (AI) applications. In this series, we streamline the process of identifying and applying the most suitable architecture for your business requirements, and help kickstart your system development efficiently with examples.

Before we dive in, we recommend reviewing Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1 for the basic functionalities of Kinesis Data Streams. Part 1 also contains architectural examples for building real-time applications for time series data and event-sourcing microservices.

Now get ready as we embark on the second part of this series, where we focus on the AI applications with Kinesis Data Streams in three scenarios: real-time generative business intelligence (BI), real-time recommendation systems, and Internet of Things (IoT) data streaming and inferencing.

Real-time generative BI dashboards with Kinesis Data Streams, Amazon QuickSight, and Amazon Q

In today’s data-driven landscape, your organization likely possesses a vast amount of time-sensitive information that can be used to gain a competitive edge. The key to unlock the full potential of this real-time data lies in your ability to effectively make sense of it and transform it into actionable insights in real time. This is where real-time BI tools such as live dashboards come into play, assisting you with data aggregation, analysis, and visualization, therefore accelerating your decision-making process.

To help streamline this process and empower your team with real-time insights, Amazon has introduced Amazon Q in QuickSight. Amazon Q is a generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your data. Amazon QuickSight is a fast, cloud-powered BI service that delivers insights.

With Amazon Q in QuickSight, you can use natural language prompts to build, discover, and share meaningful insights in seconds, creating context-aware data Q&A experiences and interactive data stories from the real-time data. For example, you can ask “Which products grew the most year-over-year?” and Amazon Q will automatically parse the questions to understand the intent, retrieve the corresponding data, and return the answer in the form of a number, chart, or table in QuickSight.

By using the architecture illustrated in the following figure, your organization can harness the power of streaming data and transform it into visually compelling and informative dashboards that provide real-time insights. With the power of natural language querying and automated insights at your fingertips, you’ll be well-equipped to make informed decisions and stay ahead in today’s competitive business landscape.

Build real-time generative business intelligence dashboards with Amazon Kinesis Data Streams, Amazon QuickSight, and Amazon Qtreaming & inferencing pipeline with AWS IoT & Amazon SageMaker

The steps in the workflow are as follows:

  1. We use Amazon DynamoDB here as an example for the primary data store. Kinesis Data Streams can ingest data in real time from data stores such as DynamoDB to capture item-level changes in your table.
  2. After capturing data to Kinesis Data Streams, you can ingest the data into analytic databases such as Amazon Redshift in near-real time. Amazon Redshift Streaming Ingestion simplifies data pipelines by letting you create materialized views directly on top of data streams. With this capability, you can use SQL (Structured Query Language) to connect to and directly ingest the data stream from Kinesis Data Streams to analyze and run complex analytical queries.
  3. After the data is in Amazon Redshift, you can create a business report using QuickSight. Connectivity between a QuickSight dashboard and Amazon Redshift enables you to deliver visualization and insights. With the power of Amazon Q in QuickSight, you can quickly build and refine the analytics and visuals with natural language inputs.

For more details on how customers have built near real-time BI dashboards using Kinesis Data Streams, refer to the following:

Real-time recommendation systems with Kinesis Data Streams and Amazon Personalize

Imagine creating a user experience so personalized and engaging that your customers feel truly valued and appreciated. By using real-time data about user behavior, you can tailor each user’s experience to their unique preferences and needs, fostering a deep connection between your brand and your audience. You can achieve this by using Kinesis Data Streams and Amazon Personalize, a fully managed machine learning (ML) service that generates product and content recommendations for your users, instead of building your own recommendation engine from scratch.

With Kinesis Data Streams, your organization can effortlessly ingest user behavior data from millions of endpoints into a centralized data stream in real time. This allows recommendation engines such as Amazon Personalize to read from the centralized data stream and generate personalized recommendations for each user on the fly. Additionally, you could use enhanced fan-out to deliver dedicated throughput to your mission-critical consumers at even lower latency, further enhancing the responsiveness of your real-time recommendation system. The following figure illustrates a typical architecture for building real-time recommendations with Amazon Personalize.

Build real-time recommendation systems with Kinesis Data Streams and Amazon Personalize

The steps are as follows:

  1. Create a dataset group, schemas, and datasets that represent your items, interactions, and user data.
  2. Select the best recipe matching your use case after importing your datasets into a dataset group using Amazon Simple Storage Service(Amazon S3), and then create a solution to train a model by creating a solution version. When your solution version is complete, you can create a campaign for your solution version.
  3. After a campaign has been created, you can integrate calls to the campaign in your application. This is where calls to the GetRecommendations or GetPersonalizedRanking APIs are made to request near-real-time recommendations from Amazon Personalize. Your website or mobile application calls a AWS Lambda function over Amazon API Gateway to receive recommendations for your business apps.
  4. An event tracker provides an endpoint that allows you to stream interactions that occur in your application back to Amazon Personalize in near-real time. You do this by using the PutEvents API. You can build an event collection pipeline using API Gateway, Kinesis Data Streams, and Lambda to receive and forward interactions to Amazon Personalize. The event tracker performs two primary functions. First, it persists all streamed interactions so they will be incorporated into future retrainings of your model. This is also how Amazon Personalize cold starts new users. When a new user visits your site, Amazon Personalize will recommend popular items. After you stream in an event or two, Amazon Personalize immediately starts adjusting recommendations.

To learn how other customers have built personalized recommendations using Kinesis Data Streams, refer to the following:

Real-time IoT data streaming and inferencing with AWS IoT Core and Amazon SageMaker

From office lights that automatically turn on as you enter the room to medical devices that monitors a patient’s health in real time, a proliferation of smart devices is making the world more automated and connected. In technical terms, IoT is the network of devices that connect with the internet and can exchange data with other devices and software systems. Many organizations increasingly rely on the real-time data from IoT devices, such as temperature sensors and medical equipment, to drive automation, analytics, and AI systems. It’s important to choose a robust streaming solution that can achieve very low latency and handle high volumes of data throughputs to power the real-time AI inferencing.

With Kinesis Data Streams, IoT data across millions of devices can simultaneously write to a centralized data stream. Alternatively, you can use AWS IoT Core to securely connect and easily manage the fleet of IoT devices, collect the IoT data, and then ingest to Kinesis Data Streams for real-time transformation, analytics, and event-driven microservices. Then, you can use integrated services such as Amazon SageMaker for real-time inference. The following diagram depicts the high-level streaming architecture with IoT sensor data.

Build real-time IoT data streaming & inferencing pipeline with AWS IoT & Amazon SageMaker

The steps are as follows:

  1. Data originates in IoT devices such as medical devices, car sensors, and industrial IoT sensors. This telemetry data is collected using AWS IoT Greengrass, an open source IoT edge runtime and cloud service that helps your devices collect and analyze data closer to where the data is generated.
  2. Event data is ingested into the cloud using edge-to-cloud interface services such as AWS IoT Core, a managed cloud platform that connects, manages, and scales devices effortlessly and securely. You can also use AWS IoT SiteWise, a managed service that helps you collect, model, analyze, and visualize data from industrial equipment at scale. Alternatively, IoT devices could send data directly to Kinesis Data Streams.
  3. AWS IoT Core can stream ingested data into Kinesis Data Streams.
  4. The ingested data gets transformed and analyzed in near real time using Amazon Managed Service for Apache Flink. Stream data can further be enriched using lookup data hosted in a data warehouse such as Amazon Redshift. Managed Service for Apache Flink can persist streamed data into Amazon Redshift after the customer’s integration and stream aggregation (for example, 1 minute or 5 minutes). The results in Amazon Redshift can be used for further downstream BI reporting services, such as QuickSight. Managed Service for Apache Flink can also write to a Lambda function, which can invoke SageMaker models. After the ML model is trained and deployed in SageMaker, inferences are invoked in a microbatch using Lambda. Inferenced data is sent to Amazon OpenSearch Service to create personalized monitoring dashboards using OpenSearch Dashboards. The transformed IoT sensor data can be stored in DynamoDB. You can use AWS AppSync to provide near real-time data queries to API services for downstream applications. These enterprise applications can be mobile apps or business applications to track and monitor the IoT sensor data in near real time.
  5. The streamed IoT data can be written to an Amazon Data Firehose delivery stream, which microbatches data into Amazon S3 for future analytics.

To learn how other customers have built IoT device monitoring solutions using Kinesis Data Streams, refer to:

Conclusion

This post demonstrated additional architectural patterns for building low-latency AI applications with Kinesis Data Streams and its integrations with other AWS services. Customers looking to build generative BI, recommendation systems, and IoT data streaming and inferencing can refer to these patterns as the starting point of designing your cloud architecture. We will continue to add new architectural patterns in the future posts of this series.

For detailed architectural patterns, refer to the following resources:

If you want to build a data vision and strategy, check out the AWS Data-Driven Everything (D2E) program.


About the Authors

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Hang Zuo is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Shwetha Radhakrishnan is a Solutions Architect for AWS with a focus in Data Analytics. She has been building solutions that drive cloud adoption and help organizations make data-driven decisions within the public sector. Outside of work, she loves dancing, spending time with friends and family, and traveling.

Brittany Ly is a Solutions Architect at AWS. She is focused on helping enterprise customers with their cloud adoption and modernization journey and has an interest in the security and analytics field. Outside of work, she loves to spend time with her dog and play pickleball.

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/architectural-patterns-for-real-time-analytics-using-amazon-kinesis-data-streams-part-1/

We’re living in the age of real-time data and insights, driven by low-latency data streaming applications. Today, everyone expects a personalized experience in any application, and organizations are constantly innovating to increase their speed of business operation and decision making. The volume of time-sensitive data produced is increasing rapidly, with different formats of data being introduced across new businesses and customer use cases. Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences.

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services.

In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices. In the subsequent post in our series, we will explore the architectural patterns in building streaming pipelines for real-time BI dashboards, contact center agent, ledger data, personalized real-time recommendation, log analytics, IoT data, Change Data Capture, and real-time marketing data. All these architecture patterns are integrated with Amazon Kinesis Data Streams.

Real-time streaming with Kinesis Data Streams

Amazon Kinesis Data Streams is a cloud-native, serverless streaming data service that makes it easy to capture, process, and store real-time data at any scale. With Kinesis Data Streams, you can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time. The collected data is available in milliseconds to allow real-time analytics use cases, such as real-time dashboards, real-time anomaly detection, and dynamic pricing. By default, the data within the Kinesis Data Stream is stored for 24 hours with an option to increase the data retention to 365 days. If customers want to process the same data in real-time with multiple applications, then they can use the Enhanced Fan-Out (EFO) feature. Prior to this feature, every application consuming data from the stream shared the 2MB/second/shard output. By configuring stream consumers to use enhanced fan-out, each data consumer receives dedicated 2MB/second pipe of read throughput per shard to further reduce the latency in data retrieval.

For high availability and durability, Kinesis Data Streams achieves high durability by synchronously replicating the streamed data across three Availability Zones in an AWS Region and gives you the option to retain data for up to 365 days. For security, Kinesis Data Streams provide server-side encryption so you can meet strict data management requirements by encrypting your data at rest and Amazon Virtual Private Cloud (VPC) interface endpoints to keep traffic between your Amazon VPC and Kinesis Data Streams private.

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details.

Modern data streaming architecture with Kinesis Data Streams

A modern streaming data architecture with Kinesis Data Streams can be designed as a stack of five logical layers; each layer is composed of multiple purpose-built components that address specific requirements, as illustrated in the following diagram:

The architecture consists of the following key components:

  • Streaming sources – Your source of streaming data includes data sources like clickstream data, sensors, social media, Internet of Things (IoT) devices, log files generated by using your web and mobile applications, and mobile devices that generate semi-structured and unstructured data as continuous streams at high velocity.
  • Stream ingestion – The stream ingestion layer is responsible for ingesting data into the stream storage layer. It provides the ability to collect data from tens of thousands of data sources and ingest in real time. You can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, or a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams. In addition, you can use many pre-build integrations such as AWS Database Migration Service (AWS DMS), Amazon DynamoDB, and AWS IoT Core to ingest data in a no-code fashion. You can also ingest data from third-party platforms such as Apache Spark and Apache Kafka Connect
  • Stream storage – Kinesis Data Streams offer two modes to support the data throughput: On-Demand and Provisioned. On-Demand mode, now the default choice, can elastically scale to absorb variable throughputs, so that customers do not need to worry about capacity management and pay by data throughput. The On-Demand mode automatically scales up 2x the stream capacity over its historic maximum data ingestion to provide sufficient capacity for unexpected spikes in data ingestion. Alternatively, customers who want granular control over stream resources can use the Provisioned mode and proactively scale up and down the number of Shards to meet their throughput requirements. Additionally, Kinesis Data Streams can store streaming data up to 24 hours by default, but can extend to 7 days or 365 days depending upon use cases. Multiple applications can consume the same stream.
  • Stream processing – The stream processing layer is responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. The streaming records are read in the order they are produced, allowing for real-time analytics, building event-driven applications or streaming ETL (extract, transform, and load). You can use Amazon Managed Service for Apache Flink for complex stream data processing, AWS Lambda for stateless stream data processing, and AWS Glue & Amazon EMR for near-real-time compute. You can also build customized consumer applications with Kinesis Consumer Library, which will take care of many complex tasks associated with distributed computing.
  • Destination – The destination layer is like a purpose-built destination depending on your use case. You can stream data directly to Amazon Redshift for data warehousing and Amazon EventBridge for building event-driven applications. You can also use Amazon Kinesis Data Firehose for streaming integration where you can light stream processing with AWS Lambda, and then deliver processed streaming into destinations like Amazon S3 data lake, OpenSearch Service for operational analytics, a Redshift data warehouse, No-SQL databases like Amazon DynamoDB, and relational databases like Amazon RDS to consume real-time streams into business applications. The destination can be an event-driven application for real-time dashboards, automatic decisions based on processed streaming data, real-time altering, and more.

Real-time analytics architecture for time series

Time series data is a sequence of data points recorded over a time interval for measuring events that change over time. Examples are stock prices over time, webpage clickstreams, and device logs over time. Customers can use time series data to monitor changes over time, so that they can detect anomalies, identify patterns, and analyze how certain variables are influenced over time. Time series data is typically generated from multiple sources in high volumes, and it needs to be cost-effectively collected in near real time.

Typically, there are three primary goals that customers want to achieve in processing time-series data:

  • Gain insights real-time into system performance and detect anomalies
  • Understand end-user behavior to track trends and query/build visualizations from these insights
  • Have a durable storage solution to ingest and store both archival and frequently accessed data.

With Kinesis Data Streams, customers can continuously capture terabytes of time series data from thousands of sources for cleaning, enrichment, storage, analysis, and visualization.

The following architecture pattern illustrates how real time analytics can be achieved for Time Series data with Kinesis Data Streams:

Build a serverless streaming data pipeline for time series data

The workflow steps are as follows:

  1. Data Ingestion & Storage – Kinesis Data Streams can continuously capture and store terabytes of data from thousands of sources.
  2. Stream Processing – An application created with Amazon Managed Service for Apache Flink can read the records from the data stream to detect and clean any errors in the time series data and enrich the data with specific metadata to optimize operational analytics. Using a data stream in the middle provides the advantage of using the time series data in other processes and solutions at the same time. A Lambda function is then invoked with these events, and can perform time series calculations in memory.
  3. Destinations – After cleaning and enrichment, the processed time series data can be streamed to Amazon Timestream database for real-time dashboarding and analysis, or stored in databases such as DynamoDB for end-user query. The raw data can be streamed to Amazon S3 for archiving.
  4. Visualization & Gain insights – Customers can query, visualize, and create alerts using Amazon Managed Service for Grafana. Grafana supports data sources that are storage backends for time series data. To access your data from Timestream, you need to install the Timestream plugin for Grafana. End-users can query data from the DynamoDB table with Amazon API Gateway acting as a proxy.

Refer to Near Real-Time Processing with Amazon Kinesis, Amazon Timestream, and Grafana showcasing a serverless streaming pipeline to process and store device telemetry IoT data into a time series optimized data store such as Amazon Timestream.

Enriching & replaying data in real time for event-sourcing microservices

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. When building event-driven microservices, customers want to achieve 1. high scalability to handle the volume of incoming events and 2. reliability of event processing and maintain system functionality in the face of failures.

Customers utilize microservice architecture patterns to accelerate innovation and time-to-market for new features, because it makes applications easier to scale and faster to develop. However, it is challenging to enrich and replay the data in a network call to another microservice because it can impact the reliability of the application and make it difficult to debug and trace errors. To solve this problem, event-sourcing is an effective design pattern that centralizes historic records of all state changes for enrichment and replay, and decouples read from write workloads. Customers can use Kinesis Data Streams as the centralized event store for event-sourcing microservices, because KDS can 1/ handle gigabytes of data throughput per second per stream and stream the data in milliseconds, to meet the requirement on high scalability and near real-time latency, 2/ integrate with Flink and S3 for data enrichment and achieving while being completely decoupled from the microservices, and 3/ allow retry and asynchronous read in a later time, because KDS retains the data record for a default of 24 hours, and optionally up to 365 days.

The following architectural pattern is a generic illustration of how Kinesis Data Streams can be used for Event-Sourcing Microservices:

The steps in the workflow are as follows:

  1. Data Ingestion and Storage – You can aggregate the input from your microservices to your Kinesis Data Streams for storage.
  2. Stream processing Apache Flink Stateful Functions simplifies building distributed stateful event-driven applications. It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. You can create a stateful functions cluster with Apache Flink based on your application business logic.
  3. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking.
  4. Output streams – The output streams can be consumed through Lambda remote functions through HTTP/gRPC protocol through API Gateway.
  5. Lambda remote functions – Lambda functions can act as microservices for various application and business logic to serve business applications and mobile apps.

To learn how other customers built their event-based microservices with Kinesis Data Streams, refer to the following:

Key considerations and best practices

The following are considerations and best practices to keep in mind:

  • Data discovery should be your first step in building modern data streaming applications. You must define the business value and then identify your streaming data sources and user personas to achieve the desired business outcomes.
  • Choose your streaming data ingestion tool based on your steaming data source. For example, you can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams, AWS DMS for CDC streaming use cases, and AWS IoT Core for ingesting IoT device data into Kinesis Data Streams. You can ingest streaming data directly into Amazon Redshift to build low-latency streaming applications. You can also use third-party libraries like Apache Spark and Apache Kafka to ingest streaming data into Kinesis Data Streams.
  • You need to choose your streaming data processing services based on your specific use case and business requirements. For example, you can use Amazon Kinesis Managed Service for Apache Flink for advanced streaming use cases with multiple streaming destinations and complex stateful stream processing or if you want to monitor business metrics in real time (such as every hour). Lambda is good for event-based and stateless processing. You can use Amazon EMR for streaming data processing to use your favorite open source big data frameworks. AWS Glue is good for near-real-time streaming data processing for use cases such as streaming ETL.
  • Kinesis Data Streams on-demand mode charges by usage and automatically scales up resource capacity, so it’s good for spiky streaming workloads and hands-free maintenance. Provisioned mode charges by capacity and requires proactive capacity management, so it’s good for predictable streaming workloads.
  • You can use the Kinesis Shared Calculator to calculate the number of shards needed for provisioned mode. You don’t need to be concerned about shards with on-demand mode.
  • When granting permissions, you decide who is getting what permissions to which Kinesis Data Streams resources. You enable specific actions that you want to allow on those resources. Therefore, you should grant only the permissions that are required to perform a task. You can also encrypt the data at rest by using a KMS customer managed key (CMK).
  • You can update the retention period via the Kinesis Data Streams console or by using the IncreaseStreamRetentionPeriod and the DecreaseStreamRetentionPeriod operations based on your specific use cases.
  • Kinesis Data Streams supports resharding. The recommended API for this function is UpdateShardCount, which allows you to modify the number of shards in your stream to adapt to changes in the rate of data flow through the stream. The resharding APIs (Split and Merge) are typically used to handle hot shards.

Conclusion

This post demonstrated various architectural patterns for building low-latency streaming applications with Kinesis Data Streams. You can build your own low-latency steaming applications with Kinesis Data Streams using the information in this post.

For detailed architectural patterns, refer to the following resources:

If you want to build a data vision and strategy, check out the AWS Data-Driven Everything (D2E) program.


About the Authors

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Hang Zuo is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Shwetha Radhakrishnan is a Solutions Architect for AWS with a focus in Data Analytics. She has been building solutions that drive cloud adoption and help organizations make data-driven decisions within the public sector. Outside of work, she loves dancing, spending time with friends and family, and traveling.

Brittany Ly is a Solutions Architect at AWS. She is focused on helping enterprise customers with their cloud adoption and modernization journey and has an interest in the security and analytics field. Outside of work, she loves to spend time with her dog and play pickleball.

Building event-driven architectures with IoT sensor data

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/building-event-driven-architectures-with-iot-sensor-data/

The Internet of Things (IoT) brings sensors, cloud computing, analytics, and people together to improve productivity and efficiency. It empowers customers with the intelligence they need to build new services and business models, improve products and services over time, understand their customers’ needs to provide better services, and improve customer experiences. Business operations become more efficient by making intelligent decisions more quickly and over time develop a data-driven discipline leading to revenue growth and greater operational efficiency.

In this post, we showcase how to build an event-driven architecture by using AWS IoT services and AWS purpose-built data services. We also discuss key considerations and best practices while building event-driven application architectures with IoT sensor data.

Deriving insights from IoT sensor data

Organizations create value by making decisions from their IoT sensor data in near real time. Some common use cases and solutions that fit under event-driven architecture using IoT sensor data include:

  • Medical device data collection for personalized patient health monitoring, adverse event prediction, and avoidance.
  • Industrial IoT use cases to monitor equipment quality and determine actions like adjusting machine settings, using different sources of raw materials, or performing additional worker training to improve the quality of the factory output.
  • Connected vehicle use cases, such as voice interaction, navigation, location-based services, remote vehicle diagnostics, predictive maintenance, media streaming, and vehicle safety, that are based on in-vehicle computing and near real-time predictive analytics in the cloud.
  • Sustainability and waste reduction solutions, which provide access to dashboards, monitoring systems, data collection, and summarization tools that use machine learning (ML) algorithms to meet sustainability goals. Meeting sustainability goals is paramount for customers in the travel and hospitality industries.

Event-driven reference architecture with IoT sensor data

Figure 1 illustrates how to architect an event-driven architecture with IoT sensor data for near real-time predictive analytics and recommendations.

Building event-driven architecture with IoT sensor data

Figure 1. Building event-driven architecture with IoT sensor data

Architecture flow:

  1. Data originates in IoT devices such as medical devices, car sensors, industrial IoT sensors.This telemetry data is collected using AWS IoT Greengrass, an open-source IoT edge runtime and cloud service that helps your devices collect and analyze data closer to where the data is generated.When an event arrives, AWS IoT Greengrass reacts autonomously to local events, filters and aggregates device data, then communicates securely with the cloud and other local devices in your network to send the data.
  2. Event data is ingested into the cloud using edge-to-cloud interface services such as AWS IoT Core, a managed cloud platform that connects, manages, and scales devices easily and securely.AWS IoT Core interacts with cloud applications and other devices. You can also use AWS IoT SiteWise, a managed service that helps you collect, model, analyze, and visualize data from industrial equipment at scale.
  3. AWS IoT Core can directly stream ingested data into Amazon Kinesis Data Streams. The ingested data gets transformed and analyzed in near real time using Amazon Kinesis Data Analytics with Apache Flink and Apache Beam frameworks.Stream data can further be enriched using lookup data hosted in a data warehouse such as Amazon Redshift. Amazon Kinesis Data Analytics can persist SQL results to Amazon Redshift after the customer’s integration and stream aggregation (for example, one minute or five minutes).The results in Amazon Redshift can be used for further downstream business intelligence (BI) reporting services, such as Amazon QuickSight.
  4. Amazon Kinesis Data Analytics can also write to an AWS Lambda function, which can invoke Amazon SageMaker models. Amazon SageMaker is a the most complete, end-to-end service for machine learning.
  5. Once the ML model is trained and deployed in SageMaker, inferences are invoked in a micro batch using AWS Lambda. Inferenced data is sent to Amazon OpenSearch Service to create personalized monitoring dashboards using Amazon OpenSearch Service dashboards.The transformed IoT sensor data can be stored in Amazon DynamoDB. Customers can use AWS AppSync to provide near real-time data queries to API services for downstream applications. These enterprise applications can be mobile apps or business applications to track and monitor the IoT sensor data in near real-time.Amazon Kinesis Data Analytics can write to an Amazon Kinesis Data Firehose stream, which is a fully managed service for delivering near real-time streaming data to destinations like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Splunk, and any custom HTTP endpoints or endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, and Sumo Logic.

    In this example, data from Amazon Kinesis Data Analytics is written to Amazon Kinesis Data Firehose, which micro-batch streams data into an Amazon S3 data lake. The Amazon S3 data lake stores telemetry data for future batch analytics.

Key considerations and best practices

Keep the following best practices in mind:

  • Define the business value from IoT sensor data through interactive discovery sessions with various stakeholders within your organization.
  • Identify the type of IoT sensor data you want to collect and analyze for predictive analytics.
  • Choose the right tools for the job, depending upon your business use case and your data consumers. Please refer to step 5 earlier in this post, where different purpose-built data services were used based on user personas.
  • Consider the event-driven architecture as three key components: event producers, event routers, and event consumers. A producer publishes an event to the router, which filters and pushes the events to consumers. Producer and consumer services are decoupled, which allows them to be scaled, updated, and deployed independently.
  • In this architecture, IoT sensors are event producers. Amazon IoT Greengrass, Amazon IoT Core, Amazon Kinesis Data Streams, and Amazon Kinesis Data Analytics work together as the router from which multiple consumers can consume IoT sensor-generated data. These consumers include Amazon S3 data lakes for telemetry data analysis, Amazon OpenSearch Service for personalized dashboards, and Amazon DynamoDB or AWS AppSync for the downstream enterprise application’s consumption.

Conclusion

In this post, we demonstrated how to build an event-driven architecture with IoT sensor data using AWS IoT services and AWS purpose-built data services. You can now build your own event-driven applications using this post with your IoT sensor data and integrate with your business applications as needed.

Further reading

Architecting near real-time personalized recommendations with Amazon Personalize

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/architecting-near-real-time-personalized-recommendations-with-amazon-personalize/

Delivering personalized customer experiences enables organizations to improve business outcomes such as acquiring and retaining customers, increasing engagement, driving efficiencies, and improving discoverability. Developing an in-house personalization solution can take a lot of time, which increases the time it takes for your business to launch new features and user experiences.

In this post, we show you how to architect near real-time personalized recommendations using Amazon Personalize and AWS purpose-built data services.  We also discuss key considerations and best practices while building near real-time personalized recommendations.

Building personalized recommendations with Amazon Personalize

Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing.

Amazon Personalize provisions the necessary infrastructure and manages the entire machine learning (ML) pipeline, including processing the data, identifying features, using the most appropriate algorithms, and training, optimizing, and hosting the models. You receive results through an Application Programming Interface (API) and pay only for what you use, with no minimum fees or upfront commitments.

Figure 1 illustrates the comparison of Amazon Personalize with the ML lifecycle.

Machine learning lifecycle vs. Amazon Personalize

Figure 1. Machine learning lifecycle vs. Amazon Personalize

First, provide the user and items data to Amazon Personalize. In general, there are three steps for building near real-time recommendations with Amazon Personalize:

  1. Data preparation: Preparing data is one of the prerequisites for building accurate ML models and analytics, and it is the most time-consuming part of an ML project. There are three types of data you use for modeling on Amazon Personalize:
    • An Interactions data set captures the activity of your users, also known as events. Examples include items your users click on, purchase, or watch. The events you choose to send are dependent on your business domain. This data set has the strongest signal for personalization, and is the only mandatory data set.
    • An Items data set includes details about your items, such as price point, category information, and other essential information from your catalog. This data set is optional, but very useful for scenarios such as recommending new items.
    • A Users data set includes details about the users, such as their location, age, and other details.
  2. Train the model with Amazon Personalize: Amazon Personalize provides recipes, based on common use cases for training models. A recipe is an Amazon Personalize algorithm prepared for a given use case. Refer to Amazon Personalize recipes for more details. The four types of recipes are:
    • USER_PERSONALIZATION: Recommends items for a user from a catalog. This is often included on a landing page.
    • RELATED_ITEM: Suggests items similar to a selected item on a detail page.
    • PERSONALZIED_RANKING: Re-ranks a list of items for a user within a category or in within search results.
    • USER_SEGMENTATION: Generates segments of users based on item input data. You can use this to create a targeted marketing campaign for particular products by brand.
  3. Get near real-time recommendations: Once your model is trained, a private personalization model is hosted for you. You can then provide recommendations for your users through a private API.

Figure 2 illustrates a high-level overview of Amazon Personalize:

Figure 2. Building recommendations with Amazon Personalize

Figure 2. Building recommendations with Amazon Personalize

Near real-time personalized recommendations reference architecture

Figure 3 illustrates how to architect near real-time personalized recommendations using Amazon Personalize and AWS purpose-built data services.

Reference architecture for near real-time recommendations

Figure 3. Near real-time recommendations reference architecture

Architecture flow:

  1. Data preparation: Start by creating a dataset group, schemas, and datasets representing your items, interactions, and user data.
  2. Train the model: After importing your data, select the recipe matching your use case, and then create a solution to train a model by creating a solution version.
    Once your solution version is ready, you can create a campaign for your solution version. You can create a campaign for every solution version that you want to use for near real-time recommendations.
    In this example architecture, we’re just showing a single solution version and campaign. If you were building out multiple personalization use cases with different recipes, you could create multiple solution versions and campaigns from the same datasets.
  3. Get near real-time recommendations: Once you have a campaign, you can integrate calls to the campaign in your application. This is where calls to the GetRecommendations or GetPersonalizedRanking APIs are made to request near real-time recommendations from Amazon Personalize.
    • The approach you take to integrate recommendations into your application varies based on your architecture but it typically involves encapsulating recommendations in a microservice or AWS Lambda function that is called by your website or mobile application through a RESTful or GraphQL API interface.
    • Near real-time recommendations support the ability to adapt to each user’s evolving interests. This is done by creating an event tracker in Amazon Personalize.
    • An event tracker provides an endpoint that allows you to stream interactions that occur in your application back to Amazon Personalize in near real-time. You do this by using the PutEvents API.
    • Again, the architectural details on how you integrate PutEvents into your application varies, but it typically involves collecting events using a JavaScript library in your website or a native library in your mobile apps, and making API calls to stream them to your backend. AWS provides the AWS Amplify framework that can be integrated into your web and mobile apps to handle this for you.
    • In this example architecture, you can build an event collection pipeline using  Amazon API Gateway, Amazon Kinesis Data Streams, and Lambda to receive and forward interactions to Amazon Personalize.
    • The Event Tracker performs two primary functions. First, it persists all streamed interactions so they will be incorporated into future retraining of your model. This also how Amazon Personalize cold starts new users. When a new user visits your site, Amazon Personalize will recommend popular items. After you stream in an event or two, Amazon Personalize immediately starts adjusting recommendations.

Key considerations and best practices

  1. For all use cases, your interactions data must have a minimum 1000 interaction records from users interacting with items in your catalog. These interactions can be from bulk imports, streamed events, or both, and a minimum 25 unique user IDs with at least two interactions for each.
  2. Metadata fields (user or item) can be used for training, filters, or both.
  3. Amazon Personalize supports the encryption of your imported data. You can specify a role allowing Amazon Personalize to use an AWS Key Management Service (AWS KMS) key to decrypt your data, or use the Amazon Simple Storage Service (Amazon S3) AES-256 server-side default encryption.
  4. You can re-train Amazon Personalize deployments based on how much interaction data you generate on a daily basis. A good rule is to re-train your models once every week or two as needed.
  5. You can apply business rules for personalized recommendations using filters. Refer to Filtering recommendations and user segments for more details.

Conclusion

In this post, we showed you how to build near real-time personalized recommendations using Amazon Personalize and AWS purpose-built data services. With the information in this post, you can now build your own personalized recommendations for your applications.

Read more and get started on building personalized recommendations on AWS:

Building SAML federation for Amazon OpenSearch Dashboards with Ping Identity

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/building-saml-federation-for-amazon-opensearch-dashboards-with-ping-identity/

Amazon OpenSearch is an open search and log analytics service, powered by the Apache Lucene search library.

In this blog post, we provide step-by-step guidance for SP-initiated SSO by showing how to set up a trial Ping Identity account. We’ll show how to build users and groups within your organization’s directory and enable SSO in OpenSearch Dashboards.

To use this feature, you must enable fine-grained access control. Rather than authenticating through Amazon Cognito or the internal user database, SAML authentication for OpenSearch Dashboards lets you use third-party identity providers to log in.

Ping Identity is an AWS Competency Partner, and the provider of the PingOne Cloud Platform is a multi-tenant Identity-as-a-Service (IDaaS) platform. Ping Identity supports both service provider (SP)-initiated and identity provider (IdP)-initiated SSO.

Overview of Ping Identity SAML authenticated solution

Figure 1 shows a sample architecture of a generic integrated solution between Ping Identity and OpenSearch Dashboards over SAML authentication.

SAML transactions between Amazon OpenSearch and Ping Identity

Figure 1. SAML transactions between Amazon OpenSearch and Ping Identity

The sign-in flow is as follows:

  1. User opens browser window and navigates to Amazon OpenSearch Dashboards
  2. Amazon OpenSearch generates SAML authentication request
  3. Amazon OpenSearch redirects request back to browser
  4. Browser redirects to Ping Identity URL
  5. Ping Identity parses SAML request, authenticates user, and generates SAML response
  6. Ping Identity returns encoded SAML response to browser
  7. Browser sends SAML response back to Amazon OpenSearch Assertion Consumer Service (ACS) URL
  8. ACS verifies SAML response
  9. User logs into Amazon OpenSearch domain

Prerequisites

For this walkthrough, you should have the following prerequisites:

  1. An AWS account
  2. A virtual private cloud (VPC)-based Amazon OpenSearch domain with fine-grained access control enabled
  3. Ping Identity account with user and a group
  4. A browser with network connectivity to Ping Identity, Amazon OpenSearch domain, and Amazon OpenSearch Dashboards.

The steps in this post are structured into the following sections:

  1. Identity provider (Ping Identity) setup
  2. Prepare Amazon OpenSearch for SAML configuration
  3. Identity provider (Ping Identity) SAML configuration
  4. Finish Amazon OpenSearch for SAML configuration
  5. Validation
  6. Cleanup

Identity provider (Ping Identity) setup

Step 1: Sign up for a Ping Identity account

  • Sign up for a Ping Identity account, then click on the Sign up button to complete your account setup.
  • If you already have an account with Ping Identity, login to your Ping Identity account.

Step 2: Create Population in Ping Identity

  • Choose Identities in the left menu and click Populations to proceed.
  • Click on the blue + button next to Populations, enter the name as IT, then click on the Save button (see Figure 2).
Creating population in Ping Identity

Figure 2. Creating population in Ping Identity

Step 3: Create a group in Ping Identity

  • Choose Groups from the left menu and click on the blue + button next to Groups. For this example, we will create a group called opensearch for Kibana access. Click on the Save button to complete the group creation.

Step 4: Create users in Ping Identity

  • Choose Users in left menu, then click the + Add User button.
  • Provide GIVEN NAME, FAMILY NAME, EMAIL ADDRESS, and choose Population as users, as created in Step 1. Choose your own USERNAME. Click on the SAVE button to create your user.
  • Add more users as needed.

Step 5: Assign role and group to users

  • Click on Identities/users in the left menu, and click on Users. Then click on the edit button for a particular user, as shown in Figure 3.
Assigning roles and groups to users in Ping Identity

Figure 3. Assigning roles and groups to users in Ping Identity

  • Click on the Edit button, click on + Add Role button, and click on the edit button to assign a role to the user.
  • For this example, choose Environment Admin, as shown in Figure 4. You can choose different roles depending on your use case.
Assigning roles to users in Ping Identity

Figure 4. Assigning roles to users in Ping Identity

  • For this example, assign administrator responsibilities for our users. Click on Show Environments, and drag Administrators into the ADDED RESPONSIBILITES section. Then click on the Add Role button.
  • Add Group to users. Go to the Groups tab, search for the opensearch group created in Step 3. Click on the + button next to opensearch to add into group memberships.

Prepare Amazon OpenSearch for SAML configuration

Once the Amazon OpenSearch domain is up and running, we can proceed with configuration.

  • Under Actions, choose Edit security configuration, as shown in Figure 5.
Enabling Amazon OpenSearch security configuration for SAML

Figure 5. Enabling Amazon OpenSearch security configuration for SAML

  • Under SAML authentication for OpenSearch Dashboards/Kibana, select Enable SAML authentication check box (Figure 6). When we enable SAML, it will create different URLs required for configuring SAML with your identity provider.
Amazon OpenSearch URLs for SAML configuration

Figure 6. Amazon OpenSearch URLs for SAML configuration

We will be using the Service Provider entity ID and SP-initiated SSO URL as highlighted in Figure 6 for Ping Identity SAML configuration. We will complete the rest of the Amazon OpenSearch SAML configuration after the Ping Identity SAML configuration.

Ping Identity SAML configuration

Go back to PingIdentity.com, and navigate to Connections on the left menu. Then select Applications, and click on Application +.

  • For this example, we are creating an application called “Kibana”
  • Select WEB APP as APPLICATION TYPE and CHOOSE CONNECTION TYPE as SAML, and click on Configure button to proceed as shown in Figure 7.
Configuring a new application in Ping Identity

Figure 7. Configuring a new application in Ping Identity

  • On the “Create App Profile” page, click on the Next button, and choose the “Manually Enter” option for PROVIDE APP METADATA. Enter the following under Configure SAML Connection section
    • ACS URL https://vpc-XXXXX-XXXXX-west-2.es.amazonaws.com/_dashboards/_opendistro/_security/saml/acs (SP-initiated SSO URL)
    • Choose Sign Assertion & Response under SIGNING KEY
    • ENTITY ID: https://vpc-XXXXX-XXXXX.us-west-2.es.amazonaws.com (Service provider entity ID)
    • ASSERTION VALIDITY DURATION (IN SECONDS) as 3600
    • Choose default options, then click on the Save and Continue button as shown in Figure 8
Configuring SAML connection in Ping Identity

Figure 8. Configuring SAML connection in Ping Identity

  • Enter the following under Configure Attribute Mapping, then click on Save and Close.
    • Set User ID to default
    • Click on +ADD ATTRIBUTE button to add following SAML attributes
      • OUTGOING VALUE: Group Names, SAML ATTRIBUTE: saml_group
      • OUTGOING VALUE: Username, SAML ATTRIBUTE: saml_username
  • Select the Policies tab and click on edit icon on the right.
  • Add the Single_Factor policy to the application, then click on Save.
  • Select the Access tab, add the opensearch group to the application, then click on Save to complete SAML configuration.
  • Finally, go to the Configuration tab, click on the Download Metadata button to download the Ping Identity metadata for the Amazon OpenSearch SAML configuration. Enable opensearch SAML application (Figure 9).
Downloading metadata in Ping Identity

Figure 9. Downloading metadata in Ping Identity

Amazon OpenSearch SAML configuration

  • Switch back to Amazon OpenSearch domain:
    • Navigate to the Amazon OpenSearch console.
    • Click on Actions, then click on Modify Security configuration.
    • Select the Enable SAML authentication check box.
  • Under Import IdP metadata section:
    • Metadata from IdP: Import the Ping Identity identity provider metadata from the downloaded XML file, shown in Figure 10.
    • SAML master backend role: opensearch (Ping Identity group). Provide SAML backend role/group SAML assertion key for group SSO into Kibana.
Configuring Amazon OpenSearch SAML parameters

Figure 10. Configuring Amazon OpenSearch SAML parameters

  • Under Optional SAML settings:
    • Leave the Subject Key as saml_subject from Ping Identity SAML application attribute name.
    • Role key should be saml_group. You can view a sample assertion during the configuration process by tools like SAML-tracer. This can help you examine and troubleshoot the contents of real assertions.
    • Session time to live (mins): 60
  • Click on the Submit button to complete Amazon OpenSearch SAML configuration for Kibana. We have successfully completed SAML configuration and are now ready for testing.

Validating Access with Ping Identity Users

  • The OpenSearch Dashboards URL can be found in the Overview tab within “General Information” in the Amazon OpenSearch console (Figure 11). The first access to the OpenSearch Dashboards URL redirects you to the Ping Identity login screen.
Validating Ping Identity users access with Amazon OpenSearch

Figure 11. Validating Ping Identity users access with Amazon OpenSearch

  • If your OpenSearch domain is hosted within a private VPC, you will not be able to access OpenSearch Dashboards over public internet. But you can still use SAML as long as your browser can communicate with both your OpenSearch cluster and your identity provider.
  • You can create a Mac or Windows EC2 instance within the same VPC and access Amazon OpenSearch Dashboards from an EC2 instance’s web browser to validate your SAML configuration. Or you can access your Amazon OpenSearch Dashboards through Site-to-Site VPN if you are trying to access it from your on-premises environment.
  • Now copy and paste the OpenSearch Dashboards URL in your browser, and enter user credentials.
  • After successful login, you will be redirected into the OpenSearch Dashboards home page. Explore our sample data and visualizations in OpenSearch Dashboards, as shown in Figure 12.
SAML authenticated Amazon OpenSearch Dashboards

Figure 12. SAML authenticated Amazon OpenSearch Dashboards

  • You have successfully federated Amazon OpenSearch Dashboards with Ping Identity as an identity provider. You can connect OpenSearch Dashboards by using your Ping Identity credentials.

Cleaning up

After you test out this solution, remember to delete all the resources you created to avoid incurring future charges. Refer to these links:

Conclusion

In this blog post, we have demonstrated how to set up Ping Identity as an identity provider over SAML authentication for Amazon OpenSearch Dashboards access. With this solution, you now have an OpenSearch Dashboard that uses Ping Identity as the custom identity provider for your users. This reduces the customer login process to one set of credentials and improves employee productivity.

Get started by checking the Amazon OpenSearch Developer Guide, which provides guidance on how to build applications using Amazon OpenSearch for your operational analytics.

Building SAML federation for Amazon OpenSearch Service with Okta

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/building-saml-federation-for-amazon-opensearch-dashboards-with-okta/

Amazon OpenSearch Service is a fully managed open search and analytics service powered by the Apache Lucene search library. Security Assertion Markup Language (SAML)-based federation for OpenSearch Dashboards lets you use your existing identity provider (IdP) like Okta to provide single sign-on (SSO) for OpenSearch Dashboards on OpenSearch Service domains.

This post shows step-by-step guidance to enable SP-initiated single sign-on (SSO) into OpenSearch Dashboards using Okta.

To use this feature, you must enable fine-grained access control. Rather than authenticating through Amazon Cognito or the internal user database, SAML authentication for OpenSearch Dashboards lets you use third-party identity providers to log in to OpenSearch Dashboards. SAML authentication for OpenSearch Dashboards is only for accessing OpenSearch Dashboards through a web browser.

Overview of Okta SAML authenticated solution

Figure 1 depicts a sample architecture of a generic, integrated solution between Okta and OpenSearch Dashboards over SAML authentication.

SAML transactions between Amazon OpenSearch Service and Okta

Figure 1. SAML transactions between Amazon OpenSearch Service and Okta

The initial sign-in flow is as follows:

  1. User opens browser window and navigates to OpenSearch Dashboards
  2. OpenSearch Service generates SAML authentication request
  3. OpenSearch Service redirects request back to browser
  4. Browser redirects to Okta URL
  5. Okta parses SAML request, authenticates user, and generates SAML response
  6. Okta returns encoded SAML response to browser
  7. Browser sends SAML response back to OpenSearch Service Assertion Consumer Services (ACS) URL
  8. ACS verifies SAML response
  9. User logs into OpenSearch Service domain

Prerequisites

For this walkthrough, you should have the following prerequisites:

  1. An AWS account
  2. A virtual private cloud (VPC)-based OpenSearch Service domain with fine-grained access control enabled
  3. Okta account with user and a group
  4. A browser with network connectivity to Okta, OpenSearch Service domain, and OpenSearch Dashboards.

The steps in this post are structured into the following sections:

  1. Identity provider (Okta) setup
  2. Prepare OpenSearch Service for SAML configuration
  3. Identity provider (Okta) SAML configuration
  4. Finish OpenSearch Service for SAML configuration
  5. Validation
  6. Cleanup

Identity provider (Okta) setup

Step 1: Sign up for an Okta account

  • Sign up for an Okta account, then click on the Sign up button to complete your account setup.
  • If you already have an account with Okta, login to your Okta account.

Step 2: Create Groups in Okta

  • Choose Directory in the left menu and click Groups to proceed.
  • Click on Add Group and enter name as opensearch. Then click on the Save button, see Figure 2.
Creating a group in Okta

Figure 2. Creating a group in Okta

Step 3: Create users in Okta

  • Choose People in left menu under Directory section and click the +Add Person button.
  • Provide First name, Last name, username (email ID), and primary email. Then select set by admin from the Password dropdown, and choose first time password. Click on the Save button to create your user.
  • Add more users as needed.

Step 4: Assign Groups to users 

  • Choose Groups from the left menu, then click on the opensearch group created in Step 2. Click on the Assign People button to add users to the opensearch group. Next, either click on individual user under Person & Username, or use the Add All button to add all existing users to the opensearch group. Click on the Save button to complete adding users to your group.

Prepare OpenSearch Service for SAML configuration

Once OpenSearch Service domain is up and running, we can proceed with configuration.

  • Navigate to the OpenSearch Service console
  • Under Actions, choose Edit security configuration as shown in Figure 3
Enabling Amazon OpenSearch Service security configuration for SAML

Figure 3. Enabling Amazon OpenSearch Service security configuration for SAML

  • Under SAML authentication for OpenSearch Dashboards/Kibana, select the Enable SAML authentication check box, see Figure 4. When we enable SAML, it will create different URLs required for configuring SAML with your identity provider.
Amazon OpenSearch Service URLs for SAML configuration

Figure 4. Amazon OpenSearch Service URLs for SAML configuration

We will be using the Service Provider entity ID and SP-initiated SSO URL (highlighted in Figure 4) for Okta SAML configuration. The OpenSearch Dashboards login flow can take one of two forms:

  • Service provider (SP) initiated: You navigate to your OpenSearch Dashboard (for example, https://my-domain.us-east-1.es.amazonaws.com/_dashboards), which redirects you to the login screen. After you log in, the identity provider redirects you to OpenSearch Dashboards.
  • Identity provider (IdP) initiated: You navigate to your identity provider, log in, and choose OpenSearch Dashboards from an application directory.

We will complete the rest of the OpenSearch Service SAML configuration after the Okta SAML configuration.

Okta SAML configuration

  • Go back to Okta.com, and choose Applications from the left menu. Click on Applications, then click on Create App Integration and choose SAML 2.0. Click on the Next button to proceed, as shown in Figure 5.
  • For this example, we are creating an application called “OpenSearch Dashboard”.
  • Select Platform as Web, and select Sign on method as SAML 2.0. Click on the Create button to proceed.
Creating a SAML app integration in Okta

Figure 5. Creating a SAML app integration in Okta

  • Enter the App name as OpenSearch, use default options, and click on the Next button to proceed.
  • Enter the following under the SAML Settings section, as shown in Figure 6. Click on the Next button to proceed.
    • Single Sign on URL = https://vpc-XXXXX-XXXXX.us-west-2.es.amazonaws.com/_dashboards/_opendistro/_security/saml/acs (SP-initiated SSO URL)
    • Audience URI(SP Entity ID) = https://vpc-XXXXX-XXXXX.us-west-2.es.amazonaws.com (Service Provider entity ID)
    • Default RelayState = leave it blank
    • Name ID format = Select EmailAddress from drop down
    • Application username = Select Okta username from dropdown
    • Update application username on = leave it set to default
  • Enter the following under Attribute Statements (optional) section.
    • Name = http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
    • Name format = Select URI Reference from dropdown
    • Value = user.email
  • Enter the following under the Group Attribute Statements (optional) section.
    • Name = http://schemas.xmlsoap.org/claims/Group
    • Name format = Select URI Reference from dropdown
    • Filter = Select Matches regex from dropdown and enter value as .*open.* to match the group created in previous steps for OpenSearch Dashboards access.
SAML configuration in Okta

Figure 6. SAML configuration in Okta

  • Select I’m a software vendor. I’d like to integrate my app with Okta under the Help Okta Support understand how you configured this application section.
  • Click on the Finish button to complete the Okta SAML application configuration.
  • Choose Sign on menu. Right click on the Identity Provider metadata hyperlink to download the Okta identity provider metadata as okta.xml. You will use this for the SAML configuration in OpenSearch Service, see Figure 7.
SAML configuration in Okta

Figure 7. Downloading Okta identity provider metadata for SAML configuration

  • Choose the Assignments menu and click on Assign-> Assign to Groups
  • Select the opensearch group, click on Assign, and click on the Done button to complete the Group assignment, as shown in Figure 8.
Assigning groups to the app in Okta

Figure 8. Assigning groups to the app in Okta

  • Switch back to the OpenSearch Service domain
  • Under the Import IdP metadata section:
    • Metadata from IdP: Import the Okta identity provider metadata from the downloaded XML file
    • SAML master backend role: opensearch (Okta group). Provide the SAML backend role/group SAML assertion key for group SSO into OpenSearch Dashboard.
  • Under Optional SAML settings:
    • Leave Subject Key blank
    • Role key should be http://schemas.xmlsoap.org/claims/Group. You can view a sample assertion during the configuration process with tools like SAML-tracer. This can help you examine and troubleshoot the contents of real assertions.
    • Session time to live (mins): 60
  • Click on the Save changes button (Figure 9) to complete OpenSearch Service SAML configuration for OpenSearch Dashboards. We have successfully completed SAML configuration, and now we are ready for testing.
Configuring Amazon OpenSearch Service SAML parameters

Figure 9. Configuring Amazon OpenSearch Service SAML parameters

Validating access with Okta users

  • Access the OpenSearch Dashboards endpoint from the previously created OpenSearch Service cluster. The OpenSearch Dashboards URL can be found in General information within “My Domains” of the OpenSearch Service console, as shown in Figure 10. The first access to OpenSearch Dashboards URL redirects you to the Okta login screen.
Validating Okta user access with Amazon OpenSearch Service

Figure 10. Validating Okta user access with Amazon OpenSearch Service

  • Now copy and paste the OpenSearch Dashboards URL in your browser, and enter the user credentials.
  • If your OpenSearch Service domain is hosted within a private VPC, you will not be able to access your OpenSearch Dashboard over public internet. But you can still use SAML as long as your browser can communicate with both your OpenSearch Service cluster and your identity provider.
  • You can create a Mac or Windows EC2 instance within the same VPC so that you can access Amazon OpenSearch Dashboard from EC2 instance’s web browser to validate your SAML configuration. Or you can access your OpenSearch Dashboard through Site-to-Site VPN from your on-premises environment.
  • After successful login, you will be redirected into the OpenSearch Dashboards home page. Here, you can explore our sample data and visualizations in OpenSearch Dashboards (Figure 11).
SAML authenticated OpenSearch dashboard

Figure 11. SAML authenticated OpenSearch Dashboards

  • Now, you have successfully federated OpenSearch Dashboards with Okta as an identity provider. You can connect OpenSearch Dashboards by using your Okta credentials.

Cleaning up

After you test out this solution, remember to delete all the resources you created, to avoid incurring future charges. Refer to these links:

Conclusion

In this blog post, we have demonstrated how to set up Okta as an identity provider over SAML authentication for OpenSearch Dashboards access. Get started by checking the Amazon OpenSearch Service Developer Guide, which provides guidance on how to build applications using OpenSearch Service.

Building SAML federation for Amazon OpenSearch Dashboards with Auth0

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/building-saml-federation-for-amazon-opensearch-dashboards-with-auth0/

Amazon OpenSearch is a fully managed, distributed, open search, and analytics service that is powered by the Apache Lucene search library. OpenSearch is derived from Elasticsearch 7.10.2, and is used for real-time application monitoring, log analytics, and website search. It’s ideal for use cases that require fast access and response for large volumes of data. OpenSearch Dashboards is derived from Kibana 7.10.2, and used for visual data exploration. With Security Assertion Markup Language (SAML)-based federation for OpenSearch, Dashboards lets you use your existing identity provider (IdP) like Auth0. You can use Auth0 to provide single sign-on (SSO) for OpenSearch Dashboards on Amazon OpenSearch search domains. It also gives you fine-grained access control, and the ability to search your data and build visualizations. Amazon OpenSearch supports providers that use the SAML 2.0 standard, such as Auth0, Okta, Keycloak, Active Directory Federation Services (AD FS), and Ping Identity (PingID).

In this post, we provide step-by-step guidance to show you how to set up a trial Auth0 account. We’ll demonstrate how to build users and groups within your organization’s directory, and enable SP-initiated single sign-on (SSO) into OpenSearch Dashboards.

To use this feature, you must enable fine-grained access control. Rather than authenticating through Amazon Cognito or an internal user database, SAML authentication for OpenSearch Dashboards lets you use third-party identity providers to log in to the OpenSearch Dashboards. SAML authentication for OpenSearch Dashboards is only for accessing the OpenSearch Dashboards through a web browser. Your SAML credentials do not let you make direct HTTP requests to OpenSearch or OpenSearch Dashboards APIs.

Auth0 is an AWS Competency Partner and popular Identity-as-a-Service (IDaaS) solution. It supports both service provider (SP)-initiated and identity provider (IdP)-initiated SSO. For SP-initiated SSO, when you sign into the OpenSearch Dashboards login page it sends an authorization request to Auth0. Once it authenticates your identity, you are redirected to OpenSearch Dashboards. In IdP-initiated SSO, you log in to the Auth0 SSO page, and choose OpenSearch Dashboards to open the application.

Overview of AuthO SAML authenticated solution

Figure 1 depicts a sample architecture of a generic, integrated solution between Auth0 and OpenSearch Dashboards over SAML authentication.

High level flow of SAML transactions between Amazon OpenSearch and Auth0

Figure 1. A high-level view of a SAML transaction between Amazon OpenSearch and Auth0

The sign-in flow is as follows:

  1. User opens browser window and navigates to Amazon OpenSearch Dashboards
  2. Amazon OpenSearch generates SAML authentication request
  3. Amazon OpenSearch redirects request back to browser
  4. Browser redirects to Auth0 URL
  5. Auth0 parses SAML request, authenticates user, and generates SAML response
  6. Auth0 returns encoded SAML response to browser
  7. Browser sends SAML response back to Amazon OpenSearch Assertion Consumer Service (ACS) URL
  8. ACS verifies SAML response
  9. User logs into Amazon OpenSearch domain

Prerequisites

For this walkthrough, you should have the following prerequisites:

  1. An AWS account
  2. A virtual private cloud (VPC) based Amazon OpenSearch domain with fine-grained access control enabled
  3. An Auth0 account with user and a group
  4. A browser with network connectivity to Auth0, Amazon OpenSearch domain, and Amazon OpenSearch Dashboards.

The steps in this post are structured into the following sections:

  1. Identity provider (Auth0) setup
  2. Prepare Amazon OpenSearch for SAML configuration
  3. Identity provider (Auth0) SAML configuration
  4. Finish Amazon OpenSearch for SAML configuration
  5. Validation
  6. Cleanup

Identity provider (Auth0) setup

Step 1: Sign up for an Auth0 account

  • Sign up for an Auth0 account, then click on the Sign up button to complete your account setup.
  • If you already have an account with Auth0, log in to your Auth0 account.

Step 2: Create Groups in Auth0

  • Choose User Management in the left menu and click Users, then click on the +Create User button.
  • Provide an email, password, and connection to your users. Click on the Create button to create your user.
  • Add more users to your Auth0 account.

Step 3: Install Auth0 Extension to create a group and assign users to the group

  • Click on Extensions in the left menu and search for “Auth0 Authorization”. Click on Auth0 Authorization to install the extension, shown in Figure 2.
The diagram depicts the Installing of Auth0 Authorization extension

Figure 2. Installing Auth0 Authorization extension

  • Use all default options and click on the Install button to install the extension.
  • Click on the Auth0 Authorization extension and choose the Accept button to provide access to your Auth0 account.
  • The Auth0 Authorization extension must be configured. Click on Go to Configuration (Figure 3).
The diagram depicts the configuration of Auth0 Authorization extension

Figure 3. Configuring the Auth0 Authorization extension

  • Rotate your API keys and check Groups, Roles, and Permissions to provide authorization to the extension and then click on PUBLISH RULE to complete the configuration, see Figure 4.
The diagram depicts the providing permissions to Auth0 Authorization extension

Figure 4. Providing the permissions to Auth0 Authorization extension

Step 4: Create a group in Auth0

  • Choose Groups from the left menu and click on the Create your first Group button. For this example, we will create a group called opensearch for OpenSearch Dashboards access.
  • Add your users to opensearch by clicking on ADD MEMBERS BUTTON, then click on the CONFIRM button to complete your group assignment (Figure 5).
The diagram depicts the adding users to Auth0 Group

Figure 5. Adding users to Auth0 Group

Step 5: Create an Auth0 Application

  • Choose Applications from the left menu. Click on the +Create Application button.
  • For this example, we are creating an application called “opensearch”.
  • Select Single Page Web Applications, then click on the CREATE button to proceed.
  • Click on the Addons tab on the application Kibana (Figure 6).
The diagram depicts the creation of Auth0 SAML application

Figure 6. Creating an Auth0 SAML application

  • Click on the SAML2 WEB APP, then select settings to provide SAML URLs from Amazon OpenSearch. We will configure these details after preparing the Amazon OpenSearch cluster for SAML.

Prepare Amazon OpenSearch for SAML configuration

Once the Amazon OpenSearch domain is up and running, we can proceed with configuration.

  • Under Actions, choose Edit security configuration (Figure 7).
The diagram depicts the enablement of OpenSearch security configuration for SAML

Figure 7. Enabling Amazon OpenSearch security configuration for SAML

  • Under SAML authentication for OpenSearch Dashboards/Kibana, select the Enable SAML authentication check box (Figure 8). When we enable SAML, it will create different URLs required for configuring SAML with your identity provider.
The diagram depicts the Amazon OpenSearch URLs for SAML configuration

Figure 8. Amazon OpenSearch URLs for SAML configuration

We will be using the Service Provider entity ID and SP-initiated SSO URL (highlighted in Figure 8) for Auth0 SAML configuration. We will complete the rest of the Amazon OpenSearch SAML configuration after the Auth0 SAML configuration.

Auth0 SAML configuration

Go back to Auth0.com, and navigate to Applications from the left menu. Then select the opensearch application that you created as a part of the Auth0 setup.

  • Click on the Addons tab on the application opensearch.
  • Click on the SAML2 WEB APP, then select Settings to provide SAML URLs from Amazon OpenSearch, as shown in Figure 9:
    • Application Callback URL = https://vpc-XXXXX-XXXXX.us-east-1.es.amazonaws.com/_dashboards/_opendistro/_security/saml/acs (SP-initiated SSO URL)
    • audience”: “https://vpc-XXXXX-XXXXX.us-east-1.es.amazonaws.com” (Service provider entity ID)
    • destination”: “ https://vpc-XXXXX-XXXXX.us-east-1.es.amazonaws.com/_plugin/kibana/_opendistro/_security/saml/acs” (SP-initiated SSO URL)
    • Mappings and other configurations shown in Figure 9

{
  "audience": "https://vpc-XXXXX-XXXXX.us-east-1.es.amazonaws.com",
  "destination": "https://vpc-XXXXX-XXXXX.us-east-1.es.amazonaws.com/_plugin/kibana/_opendistro/_security/saml/acs",
  "mappings":
  {
    "email":
    "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress",
    "name": "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name",
    "groups": "http://schemas.xmlsoap.org/claims/Group"
  },
  "createUpnClaim": false,
  "passthroughClaimsWithNoMapping": false,
  "mapUnknownClaimsAsIs": false,
  "mapIdentities": false,
  "nameIdentifierFormat":
  "urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress", "nameIdentifierProbes": [
"http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress" ]
}

The diagram depicts the configuration of Auth0 SAML parameters

Figure 9. Configuring Auth0 SAML parameters

  • Click on Enable to save the SAML configurations.
  • Go to the Usage tab, and click on the Download button to download Identity Provider Metadata, see Figure 10.
The diagram depicts the downloading of Auth0 identity provider meta data for SAML configuration

Figure 10. Downloading Auth0 identity provider metadata for SAML configuration

Amazon OpenSearch SAML configuration

  • Switch back to Amazon OpenSearch domain:
    • Navigate to Amazon OpenSearch console
    • Click on Actions, then click on Modify Security configuration
    • Select Enable SAML authentication check box
  • Under Import IdP metadata section (Figure 11):
    • Metadata from IdP: Import the Auth0 identity provider metadata from downloaded XML file
    • SAML master backend role: opensearch (Auth0 group). Provide a SAML backend role/group SAML assertion key for group SSO into Kibana
The diagram depicts the configuration of Amazon OpenSearch SAML parameters

Figure 11. Configuring Amazon OpenSearch SAML parameters

  • Under Optional SAML settings (Figure 12):
    • Leave Subject Key as blank, as Auth0 provides NameIdentifier
    • Role key should be http://schemas.xmlsoap.org/claims/Group. Auth0 lets you view a sample assertion during the configuration process by clicking on the DEBUG button on SAML2 WebApp. Tools like SAML-tracer can help you examine and troubleshoot the contents of real assertions.
    • Session time to live (mins): 60
The diagram depicts the configuration of Amazon OpenSearch optional SAML parameters

Figure 12. Configuring Amazon OpenSearch optional SAML parameters

Click on the Save changes button to complete Amazon OpenSearch SAML configuration for Kibana. We have successfully completed SAML configuration and are now ready for testing.

Validating access with Auth0 users

  • Access OpenSearch Dashboards from the previously created OpenSearch cluster. The OpenSearch Dashboards URL can be found as shown in Figure 13. The first access to the OpenSearch Dashboards URL redirects you to the Auth0 login screen.
The diagram depicts the validation of Auth0 users access with Amazon OpenSearch

Figure 13. Validating Auth0 users access with Amazon OpenSearch

  • Now copy and paste the OpenSearch Dashboards URL in your browser, and enter the user credentials.
  • If your OpenSearch domain is hosted within a private VPC, you will not be able to access your OpenSearch Dashboard over the public internet. But you can still use SAML as long as your browser can communicate with both your OpenSearch cluster and your identity provider.
  • You can create a Mac or Windows EC2 instance within the same VPC. This way you can access Amazon OpenSearch Dashboards from your EC2 instance’s web browser to validate your SAML configuration. You can also access Amazon OpenSearch Dashboards through Site-to-Site VPN from an on-premises environment.
  • After successful login, you will be redirected into the OpenSearch Dashboards home page. Explore our sample data and visualizations in OpenSearch Dashboards, as shown in Figure 14.
SAML authenticated Amazon OpenSearch Dashboards

Figure 14. SAML authenticated Amazon OpenSearch Dashboards

  • You now have successfully federated Amazon OpenSearch Dashboards with Auth0 as an identity provider. You can connect OpenSearch Dashboards by using your Auth0 credentials.

Cleaning up

After you test out this solution, remember to delete all the resources you created to avoid incurring future charges. Refer to these links:

Conclusion

In this blog post, we have demonstrated how to set up Auth0 as an identity provider over SAML authentication for Amazon OpenSearch Dashboards access. With this solution, you now have an OpenSearch Dashboard that uses Auth0 as the custom identity provider for your users. This reduces the customer login process to one set of credentials and improves employee productivity.

Get started by checking the Amazon OpenSearch Developer Guide, which provides guidance on how to build applications using Amazon OpenSearch for your operational analytics.

How Parametric Built Audit Surveillance using AWS Data Lake Architecture

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-parametric-built-audit-surveillance-using-aws-data-lake-architecture/

Parametric Portfolio Associates (Parametric), a wholly owned subsidiary of Morgan Stanley, is a registered investment adviser. Parametric provides investment advisory services to individual and institutional investors around the world. Parametric manages over 100,000 client portfolios with assets under management exceeding $400B (as of 9/30/21).

As a registered investment adviser, Parametric is subject to numerous regulatory requirements. The Parametric Compliance team conducts regular reviews on the firm’s portfolio management activities. To accomplish this, the organization needs both active and archived audit data to be readily available.

Parametric’s on-premises data lake solution was based on an MS-SQL server. They used an Apache Hadoop platform for their data storage, data management, and analytics. Significant gaps existed with the on-premises solution, which complicated audit processes. They were spending a large amount of effort on system maintenance, operational management, and software version upgrades. This required expensive consulting services and challenges with keeping the maintenance windows updated. This limited their agility, and also impacted their ability to derive more insights and value from their data. In an environment of rapid growth, adoption of more sophisticated analytics tools and processes has been slower to evolve.

In this blog post, we will show how Parametric implemented their Audit Surveillance Data Lake on AWS with purpose-built fully managed analytics services. With this solution, Parametric was able to respond to various audit requests within hours rather than days or weeks. This resulted in a system with a cost savings of 5x, with no data growth. Additionally, this new system can seamlessly support a 10x data growth.

Audit surveillance platform

The Parametric data management office (DMO) was previously running their data workloads using an on-premises data lake, which ran on the Hortonworks data platform of Apache Hadoop. This platform wasn’t up to date, and Parametric’s hardware was reaching end-of-life. Parametric was faced with a decision to either reinvest in their on-premises infrastructure or modernize their infrastructure using a modern data analytics platform on AWS. After doing a detailed cost/benefit analysis, the DMO calculated a 5x cost savings by using AWS. They decided to move forward and modernize with AWS due to these cost benefits, in addition to elasticity and security features.

The PPA compliance team asked the DMO to provide an enterprise data service to consume data from a data lake. This data was destined for downstream applications and ad-hoc data querying capabilities. It was accessed via standard JDBC tools and user-friendly business intelligence dashboards. The goal was to ensure that seven years of audit data would be readily available.

The DMO team worked with AWS to conceptualize an audit surveillance data platform architecture and help accelerate the implementation. They attended a series of AWS Immersion Days focusing on AWS fundamentals, Data Lakes, Devops, Amazon EMR, and serverless architectures. They later were involved in a four-day AWS Data Lab with AWS SMEs to create a data lake. The first use case in this Lab was creating the Audit Surveillance system on AWS.

Audit surveillance architecture on AWS

The following diagram shows the Audit Surveillance data lake architecture on AWS by using AWS purpose-built analytics services.

Figure 1. Audit Surveillance data lake architecture diagram

Figure 1. Audit Surveillance data lake architecture diagram

Architecture flow

  1. User personas: As first step, the DMO team identified three user personas for the Audit Surveillance system on AWS.
    • Data service compliance users who would like to consume audit surveillance data from the data lake into their respective applications through an enterprise data service.
    • Business users who would like to create business intelligence dashboards using a BI tool to audit data for compliance needs.
    • Complaince IT users who would like to perform ad-hoc queries on the data lake to perform analytics using an interactive query tool.
  2. Data ingestion: Data is ingested into Amazon Simple Storage Service (S3) from different on-premises data sources by using AWS Lake Formation blueprints. AWS Lake Formation provides workflows that define the data source and schedule to import data into the data lake. It is a container for AWS Glue crawlers, jobs, and triggers that are used to orchestrate the process to load and update the data lake.
  3. Data storage: Parametric used Amazon S3 as a data storage to build an Audit Surveillance data lake, as it has unmatched 11 nines of durability and 99.99% availability. The existing Hadoop storage was replaced with Amazon S3. The DMO team created a drop zone (raw), an analytics zone (transformed), and curated (enriched) storage layers for their data lake on AWS.
  4. Data cataloging: AWS Glue Data Catalog was the central catalog used to store and manage metadata for all datasets hosted in the Audit Surveillance data lake. The existing Hadoop metadata store was replaced with AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena, natively integrate with AWS Glue Data Catalog.
  5. Data processing: Amazon EMR and AWS Glue process the raw data and places it into analytics zones (transformed) and curated zones (enriched) S3 buckets. Amazon EMR was used for big data processing and AWS Glue for standard ETL processes. AWS Lambda and AWS Step Functions were used to initiate monitoring and ETL processes.
  6. Data consumption: After Audit Surveillance data was transformed and enriched, the data was consumed by various personas within the firm as follows:
    • AWS Lambda and Amazon API Gateway were used to support consumption for data service compliance users.
    • Amazon QuickSight was used to create business intelligence dashboards for compliance business users.
    • Amazon Athena was used to query transformed and enriched data for compliance IT users.
  7. Security: AWS Key Management Service (KMS) customer managed keys were used for encryption at rest, and TLS for encryption at transition. Access to the encryption keys is controlled using AWS Identity and Access Management (IAM) and is monitored through detailed audit trails in AWS CloudTrail. Amazon CloudWatch was used for monitoring, and thresholds were created to determine when to send alerts.
  8. Governance: AWS IAM roles were attached to compliance users that permitted the administrator to grant access. This was only given to approved users or programs that went through authentication and authorization through AWS SSO. Access is logged and permissions can be granted or denied by the administrator. AWS Lake Formation is used for fine-grained access controls to grant/revoke permissions at the database, table, or column-level access.

Conclusion

The Parametric DMO team successfully replaced their on-premises Audit Surveillance Data Lake. They now have a modern, flexible, highly available, and scalable data platform on AWS, with purpose-built analytics services.

This change resulted in a 5x cost savings, and provides for a 10x data growth. There are now fast responses to internal and external audit requests (hours rather than days or weeks). This migration has given the company access to a wider breadth of AWS analytics services, which offers greater flexibility and options.

Maintaining the on-premises data lake would have required significant investment in both hardware upgrade costs and annual licensing and upgrade vendor consulting fees. Parametric’s decision to migrate their on-premises data lake has yielded proven cost benefits. And it has introduced new functions, service, and capabilities that were previously unavailable to Parametric DMO.

You may also achieve similar efficiencies and increase scalability by migrating on-premises data platforms into AWS. Read more and get started on building Data Lakes on AWS.

Amazon SES configuration for an external SMTP provider with Auth0

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/messaging-and-targeting/amazon-ses-configuration-for-an-external-smtp-provider-with-auth0/

Many organizations are using an external identity provider to manage user identities. With an identity provider (IdP), customers can manage their user identities outside of AWS and give these external user identities permissions to use AWS resources in customer AWS accounts. The most common requirement when setting up an external identity provider is sending outgoing emails, such as verification e-mails using a link or code, welcome e-mails, MFA enrollment, password changes and blocked account e-mails. This said, most external identity providers’ existing e-mail infrastructure is limited to testing e-mails only and customers need to set up an external SMTP provider for outgoing e-mails.

Managing and running e-mail servers on-premises or deploying an EC2 instance dedicated to run a SMTP server is costly and complex. Customers have to manage operational issues such as hardware, software installation, configuration, patching, and backups.

In this blog post, we will provide step-by-step guidance showing how you can set up Amazon SES as an external SMTP provider with Auth0 to take advantage of Amazon SES capabilities like sending email securely, globally, and at scale.

Amazon Simple Email Service (SES) is a cost-effective, flexible, and scalable email service that enables developers to send email from within any application. You can configure Amazon SES quickly to support several email use cases, including transactional, marketing, or mass email communications.

Auth0 is an identity provider that provides flexible, drop-in solution to add authentication and authorization services (Identity as a Service, or IDaaS) to customer applications. Auth0’s built-in email infrastructure should be used for testing emails only. Auth0 allows you to configure your own SMTP email provider so you can more completely manage, monitor, and troubleshoot your email communications.

Overview of solution

In this blog post, we’ll show you how to perform the below steps to complete the integration between Amazon SES and Auth0

  • Amazon SES setup for sending emails with SMTP credentials and API credentials
  • Auth0 setup to configure Amazon SES as an external SMTP provider
  • Testing the Configuration

The following diagram shows the architecture of the solution.

Prerequisites

Amazon SES Setup

As first step, you must configure a “Sandbox” account within Amazon SES and verify a sender email address for initial testing. Once all the setup steps are successful, you can convert this account into Production and the SES service will be accepting all emails and for more details on this topic, please see the Amazon SES documentation.

1. Log in to the Amazon SES console and choose the Verify a New Email Address button.

2. Once the verification is completed, the Verification Status will change to green under Verification Status  

3. You need to create SMTP credentials which will be used by Auth0 for sending emails.  To create the credentials, click on SMTP settings from left menu and press the Create My SMTP Credentials button.

Please note down the Server Name as it will be required during Auth0 setup.

4. Enter a meaningful username like autho-ses-user and click on Create bottom in the bottom-right page

5. You can see the SMTP username and password on the screen and also, you can download SMTP credentials into a csv file as shown below.

Please note the SMTP User name and SMTP Password as it will be required during Auth0 setup.

6. You need Access key ID and Secret access key of the SES IAM user autho-ses-user as created in step 3 for configuring Amazon SES with API credentials in Auth0.

  • Navigate to the AWS IAM console and click on Users in left menu
  • Double click on autho-ses-user IAM user and then, click on Security credentials

  • Choose on Create access key button to create new Access key ID and Secret access key. You can see the Access key ID and Secret access key on the screen and also, you can download them into a csv file as shown below.

Please note down the Access key ID and Secret access key as it will be required during Auth0 setup.

Auth0 Setup

To ensure that emails can be sent from Auth0 to your Amazon SES SMTP, you need to configure Amazon SES details into Auth0. There are two ways you can use Amazon SES credentials with Auth0, one with SMTP and the other with API credentials.

1. Navigate to auth0 Dashboard, Select Branding and then, Email Provider from left menu. Enable Use my own email provider button as shown below.

2. Let us start with Auth0 configuration with Amazon SES SMTP credentials.

  • Click on SMTP Provider option as shown below

  • Provide below SMTP Provider settings as shown below and then, click on Save button complete the setup.
    • From: Your from email address.
    • Host: Your Amazon SES Server name as created in step 2 of Amazon SES setup. For this example, it is email-smtp.us-west-1.amazonaws.com
    • Port: 465
    • User Name: Your Amazon SES SMTP user name as created in step 4 of Amazon SES setup.
    • Password: Your Amazon SES SMTP password as created in step 4 of Amazon SES setup.

  • Choose on Send test email button to test Auth0 configuration with Amazon SES SMTP credentials.
  • You can look at Autho logs to validate your test as shown below.

  • If you have configured it successfully, you should receive an email from auth0 as shown below.

3. Now, complete Auth0 configuration with Amazon SES API credentials.

  • Click on Amazon SES as shown below

  • Provide Amazon SES settings as shown below and then, click on Save button complete the setup.
    • From: Your from email address.
    • KeyKey Id: Your autho-ses-user IAM user’s Access key ID as created in step 5 of Amazon SES setup.
    • Secret access key: Your autho-ses-user IAM user’s Secret access key as created in step 5 of Amazon SES setup.
    • Region: For this example, choose us-west-1.

  • Click on the Send test email button to test Auth0 configuration with Amazon SES API credentials.
  • You can look at Auth0 logs and If you have configured successfully, you should receive an email from auth0 as illustrated in Auth0 configuration with Amazon SES SMTP credentials section.

Conclusion

In this blog post, we have demonstrated how to setup Amazon SES as an external SMTP email provider with Auth0 as Auth0’s built-in email infrastructure is limited for testing emails. We have also demonstrated how quickly and easily you can setup Amazon SES with SMTP credentials and API credentials. With this solution you can setup your own Amazon SES with Auth0 as an email provider. You can also get a JumpStart by checking the Amazon SES Developer guide, which provides guidance on Amazon SES that provides an easy, cost-effective way for you to send and receive email using your own email addresses and domains.

About the authors

Raghavarao Sodabathina

Raghavarao Sodabathina

Raghavarao Sodabathina is an Enterprise Solutions Architect at AWS. His areas of focus are Data Analytics, AI/ML, and the Serverless Platform. He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

 

Pawan Matta

Pawan Matta is a Boston-based Gametech Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital native business. His core areas of focus are management and governance and cost optimization. In his free time, Pawan loves watching cricket and playing video games with friends.

How to Accelerate Building a Lake House Architecture with AWS Glue

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building lake house architectures with AWS Glue

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

  1. Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
  2. The Data Catalog serves as the central metadata catalog for the entire data landscape.
  3. Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
  4. Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
  5. Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake house persona-centric data analytics using AWS Glue

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

  1. Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
  2. Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
  3. Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
  4. Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
  5. Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

Analyze Fraud Transactions using Amazon Fraud Detector and Amazon Athena

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/analyze-fraud-transactions-using-amazon-fraud-detector-and-amazon-athena/

Organizations with online businesses have to be on guard constantly for fraudulent activity, such as fake accounts or payments made with stolen credit cards. One way they try to identify fraudsters is by using fraud detection applications. Some of these applications use machine learning (ML).

A common challenge with ML is the need for a large, labeled dataset to create ML models to detect fraud. You will also need the skill set and infrastructure to build, train, deploy, and scale your ML model.

In this post, I discuss how to perform fraud detection on a batch of many events using Amazon Fraud Detector. Amazon Fraud Detector is a fully managed service that can identify potentially fraudulent online activities. These can be situations such as the creation of fake accounts or online payment fraud. Unlike general-purpose ML packages, Amazon Fraud Detector is designed specifically to detect fraud. You can analyze fraud transaction prediction results by using Amazon Athena and Amazon QuickSight. I will explain how to review fraud using Amazon Fraud Detector and Amazon SageMaker built-in algorithms.

Batch fraud prediction use cases

You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. You may want to generate fraud predictions for a batch of events. These might be payment fraud, account take over or compromise, and free tier misuse while performing an offline proof-of-concept. You can also use batch predictions to evaluate the risk of events on an hourly, daily, or weekly basis depending upon your business need.

Batch fraud insights using Amazon Fraud Detector

Organizations such as ecommerce companies and credit card companies use ML to detect the fraud. Some of the most common types of fraud include email account compromise (personal or business), new account fraud, and non-payment or non-delivery (which includes compromised card numbers).

Amazon Fraud Detector automates the time-consuming and expensive steps to build, train, and deploy an ML model for fraud detection. Amazon Fraud Detector customizes each model it creates to your dataset, making the accuracy of models higher than current one-size-fits-all ML solutions. And because you pay only for what you use, you can avoid large upfront expenses.

If you want to analyze fraud transactions after the fact, you can perform batch fraud predictions using Amazon Fraud Detector. Then you can store fraud prediction results in an Amazon S3 bucket. Amazon Athena helps you analyze the fraud prediction results. You can create fraud prediction visualization dashboards using Amazon QuickSight.

The following diagram illustrates how to perform fraud predictions for a batch of events and analyze them using Amazon Athena.

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

The architecture flow follows these general steps:

  1. Create and publish a detector. First create and publish a detector using Amazon Fraud Detector. It should contain your fraud prediction model and rules. For additional details, see Get started (console).
  2. Create an input Amazon S3 bucket and upload your CSV file. Prepare a CSV file that contains the events you want to evaluate. Then upload your CSV file into the input S3 bucket. In this file, include a column for each variable in the event type associated with your detector. In addition, include columns for EVENT_ID, ENTITY_ID, EVENT_TIMESTAMP, ENTITY_TYPE. Refer to Amazon Fraud Detector batch input and output files for more details. Read Create a variable for additional information on Amazon Fraud Detector variable data types and formatting.
  3. Create an output Amazon S3 bucket. Create an output Amazon S3 bucket to store your Amazon Fraud Detector prediction results.
  4. Perform a batch prediction. You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. Read more here about Batch predictions.
  5. Review your prediction results. Review your results in the CSV file that is generated and stored in the Amazon S3 output bucket.
  6. Analyze your fraud prediction results.
    • After creating a Data Catalog by using AWS Glue, you can use Amazon Athena to analyze your fraud prediction results with standard SQL.
    • You can develop user-friendly dashboards to analyze fraud prediction results using Amazon QuickSight by creating new datasets with Amazon Athena as your data source.

Fraud detection using Amazon SageMaker

The Amazon Web Services (AWS) Solutions Implementation, Fraud Detection Using Machine Learning, enables you to run automated transaction processing. This can be on an example dataset or your own dataset. The included ML model detects potentially fraudulent activity and flags that activity for review. The diagram following presents the architecture you can automatically deploy using the solution’s implementation guide and accompanying AWS CloudFormation template.

SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types. This solution leverages the built-in Random Cut Forest algorithm for unsupervised learning and the built-in XGBoost algorithm for supervised learning. In the SageMaker Developer Guide, you can see how Random Cut Forest and XGBoost algorithms work.

Figure 2. Fraud detection using machine learning architecture on AWS

Figure 2. Fraud detection using machine learning architecture on AWS

This architecture can be segmented into three phases.

  1. Develop a fraud prediction machine learning model. The AWS CloudFormation template deploys an example dataset of credit card transactions contained in an Amazon Simple Storage Service (Amazon S3) bucket. An Amazon SageMaker notebook instance with different ML models will be trained on the dataset.
  2. Perform fraud prediction. The solution also deploys an AWS Lambda function that processes transactions from the example dataset. It invokes the two SageMaker endpoints that assign anomaly scores and classification scores to incoming data points. An Amazon API Gateway REST API initiates predictions using signed HTTP requests. An Amazon Kinesis Data Firehose delivery stream loads the processed transactions into another Amazon S3 bucket for storage. The solution also provides an example of how to invoke the prediction REST API as part of the Amazon SageMaker notebook.
  3. Analyze fraud transactions. Once the transactions have been loaded into Amazon S3, you can use analytics tools and services for visualization, reporting, ad-hoc queries, and more detailed analysis.

By default, the solution is configured to process transactions from the example dataset. To use your own dataset, you must modify the solution. For more information, see Customization.

Conclusion

In this post, we showed you how to analyze fraud transactions using Amazon Fraud Detector and Amazon Athena. You can build fraud insights using Amazon Fraud Detector and Amazon SageMaker built-in algorithms Random Cut Forest and XGBoost. With the information in this post, you can build your own fraud insights models on AWS. You’ll be able to detect fraud faster. Finally, you’ll be able to solve a variety of fraud types. These can be new account fraud, online transaction fraud, and fake reviews, among others.

Read more and get started on building fraud detection models on AWS.

Architecting Persona-centric Data Platform with On-premises Data Sources

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/architecting-persona-centric-data-platform-with-on-premises-data-sources/

Many organizations are moving their data from silos and aggregating it in one location. Collecting this data in a data lake enables you to perform analytics and machine learning on that data. You can store your data in purpose-built data stores, like a data warehouse, to get quick results for complex queries on structured data.

In this post, we show how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi. We will also discuss Lake House architecture on AWS, which is the next evolution from data warehouse and data lake-based solutions.

Data movement services

AWS provides a wide variety of services to bring data into a data lake:

You may want to bring on-premises data into the AWS Cloud to take advantage of AWS purpose-built analytics services, derive insights, and make timely business decisions. Apache NiFi is an open source tool that enables you to move and process data using a graphical user interface.

For this use case and solution architecture, we use Apache NiFi to ingest data into Amazon S3 and AWS purpose-built analytics services, based on user personas.

Building persona-centric data platform on AWS

When you are building a persona-centric data platform for analytics and machine learning, you must first identify your user personas. Who will be using your platform? Then choose the appropriate purpose-built analytics services. Envision a data platform analytics architecture as a stack of seven layers:

  1. User personas: Identify your user personas for data engineering, analytics, and machine learning
  2. Data ingestion layer: Bring the data into your data platform and data lineage lifecycle view, while ingesting data into your storage layer
  3. Storage layer: Store your structured and unstructured data
  4. Cataloging layer: Store your business and technical metadata about datasets from the storage layer
  5. Processing layer: Create data processing pipelines
  6. Consumption layer: Enable your user personas for purpose-built analytics
  7. Security and Governance: Protect your data across the layers

Reference architecture

The following diagram illustrates how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi.

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Architecture flow:

    1. Identify user personas: You must first identify user personas to derive insights from your data platform. Let’s start with identifying your users:
      • Enterprise data service users who would like to consume data from your data lake into their respective applications.
      • Business users who would like to like create business intelligence dashboards by using your data lake datasets.
      • IT users who would like to query data from your data lake by using traditional SQL queries.
      • Data scientists who would like to run machine learning algorithms to derive recommendations.
      • Enterprise data warehouse users who would like to run complex SQL queries on your data warehouse datasets.
    2. Data ingestion layer: Apache NiFi scans the on-premises data stores and ingest the data into your data lake (Amazon S3). Apache NiFi can also transform the data in transit. It supports both Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) data transformations. Apache NiFi also supports data lineage lifecycle while ingesting data into Amazon S3.
    3. Storage layer: For your data lake storage, we recommend using Amazon S3 to build a data lake. It has unmatched 11 nines of durability and 99.99% availability. You can also create raw, transformed, and enriched storage layers depending upon your use case.
    4. Cataloging layer: AWS Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake by AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation. They automate discovering and registering dataset metadata into the Lake Formation catalog.
    5. Processing layer: Amazon EMR processes your raw data and places them into a new S3 bucket. Use AWS Glue DataBrew and AWS Glue to process the data as needed.
    6. Consumption layer or persona-centric analytics: Once data is transformed:
      • AWS Lambda and Amazon API Gateway will allow you to develop data services for enterprise data service users
      • You can develop user-friendly dashboards for your business users using Amazon QuickSight
      • Use Amazon Athena to query transformed data for your IT users
      • Your data scientists can utilize AWS Glue DataBrew to clean and normalize the data and Amazon SageMaker for machine learning models
      • Your enterprise data warehouse users can use Amazon Redshift to derive business intelligence
    7. Security and governance layer: AWS IAM provides users, groups, and role-level identity, in addition to the ability to configure coarse-grained access control for resources managed by AWS services in all layers. AWS Lake Formation provides fine-grained access controls and you can grant/revoke permissions at the database- or table- or column-level access.

Lake House architecture on AWS

The vast majority of data lakes are built on Amazon S3. At the same time, customers are leveraging purpose-built analytics stores that are optimized for specific use cases. Customers want the freedom to move data between their centralized data lakes and the surrounding purpose-built analytics stores. And they want to get insights with speed and agility in a seamless, secure, and compliant manner. We call this modern approach to analytics the Lake House architecture.

Figure 2. Lake House architecture on AWS

Figure 2. Lake House architecture on AWS

Refer to the whitepaper Derive Insights from AWS Lake house for various design patterns to derive persona-centric analytics by using the AWS Lake House approach. Check out the blog post Build a Lake House Architecture on AWS  for a Lake House reference architecture on AWS.

Conclusion

In this post, we show you how to build a persona-centric data platform on AWS with a seven-layered approach. This uses Apache NiFi as a data ingestion tool and AWS purpose-built analytics services for persona-centric analytics and machine learning. We have also shown how to build persona-centric analytics by using the AWS Lake House approach.

With the information in this post, you can now build your own data platform on AWS to gain faster and deeper insights from your data. AWS provides you the broadest and deepest portfolio of purpose-built analytics and machine learning services to support your business needs.

Read more and get started on building a data platform on AWS:

Fine-tuning blue/green deployments on application load balancer

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/devops/blue-green-deployments-with-application-load-balancer/

In a traditional approach to application deployment, you typically fix a failed deployment by redeploying an older, stable version of the application. Redeployment in traditional data centers is typically done on the same set of resources due to the cost and effort of provisioning additional resources. Applying the principles of agility, scalability, and automation capabilities of AWS can shift the paradigm of application deployment. This enables a better deployment technique called blue/green deployment.

Blue/green deployments provide near-zero downtime release and rollback capabilities. The fundamental idea behind blue/green deployment is to shift traffic between two identical environments that are running different versions of your application. The blue environment represents the current application version serving production traffic. In parallel, the green environment is staged running a newer version of your application. After the green environment is ready and tested, production traffic is redirected from blue to green. If any problems are identified, you can roll back by reverting traffic to the blue environment.

Canary deployments are a pattern for the slow rollout of new version of an existing application. The canary deployments incrementally deploy the new version, making it visible to new users in a slow fashion. As you gain confidence in the deployment, you can deploy it to replace the current version in its entirety.

AWS provides several options to help you automate and streamline your blue/green deployments and canary deployments, one such approach is using Application Load Balancer weighted target group feature. In this post, we  will cover the concepts of  target group stickiness, load balancer stickiness,  connection draining and  how they influence traffic shifting for  canary  and blue/green deployments when using the Application Load Balancer weighted target group feature .

Application Load Balancer weighted target groups

A target group is used to route requests to one or more registered targets like Amazon Elastic Compute Cloud (Amazon EC2) instances, fixed IP addresses, or AWS Lambda functions, among others. When creating a load balancer, you create one or more listeners and configure listener rules to direct the traffic to a target group.

Application Load Balancers now support weighted target groups routing. With this feature, you can add more than one target group to the forward action of a listener rule, and specify a weight for each group. For example, when you define a rule as having two target groups with weights of 9 and 1, the load balancer routes 90% of the traffic to the first target group and 10% to the other target group. You can create and configure your weighted target groups by using AWS Console , AWS CLI or AWS SDK.

For more information, see How do I set up weighted target groups for my Application Load Balancer?

Target group level stickiness

You can set target group stickiness to make sure clients get served from a specific target group for a configurable duration of time to ensure consistent experience. Target group stickiness is different from the already existing load balancer stickiness (also known as sticky sessions). Sticky sessions make sure that the requests from a client are always sticking to a particular target within a target group. Target group stickiness only ensures the requests are sent to a particular target group.

You can enable target group level stickiness using the AWS Command Line Interface (AWS CLI) with the TargetGroupStickinessConfig parameter, as shown in the following CLI command:

aws elbv2 modify-listener \
    --listener-arn " < LISTENER ARN > " \
    --default-actions \
    '[{
       "Type": "forward",
       "Order": 1,
       "ForwardConfig": {
          "TargetGroups": [
             {"TargetGroupArn": "<Blue Target Group ARN>", "Weight": 90}, \
             {"TargetGroupArn": "<Green Target Group ARN>", "Weight": 10}, \
          ],
          "TargetGroupStickinessConfig": {
             "Enabled": true,
             "DurationSeconds": 120
          }
       }
    }]'

In the next sections, we will see how to fine-tune weighted target group  configuration to achieve effective canary deployments and blue/green deployments.

Canary deployments with Application Load Balancer weighted target group

The canary deployment pattern allows you to roll out a new version of your application to a subset of users before making it widely available. This can be helpful in validating the stability of a new version of the application or performing A/B testing.

For this use case, you want to perform canary deployment for your application and test it by driving only 10% of the incoming traffic to your new version for 12 hours. You need to create two weighted target groups for your Application Load Balancer and use target group stickiness set to a duration of 12 hours. When target group stickiness is enabled, the requests from a client are sent to the same target group for the specified time duration.

Blue and green target groups with weights 90 and 10 for canary deployment

Figure 1: Blue and green target groups with weights 90 and 10 for canary deployment

We can define a rule as having two target groups, blue and green, with weights of 90 and 10, respectively, and enable target group level stickiness with a duration of 12 hours (43,200 seconds). The following table summarizes this configuration. See the following CLI command:

aws elbv2 modify-listener \
    --listener-arn " < LISTENER ARN > " \
    --default-actions \
    '[{
       "Type": "forward",
       "Order": 1,
       "ForwardConfig": {
          "TargetGroups": [
             {"TargetGroupArn": "<Blue Target Group ARN>", "Weight": 90}, \
             {"TargetGroupArn": "<Green Target Group ARN>", "Weight": 10}, \
          ],
          "TargetGroupStickinessConfig": {
             "Enabled": true,
             "DurationSeconds": 43200
          }
       }
    }]'

At this point, the users with existing sessions continue to be sent to the blue target group running version 1, and 10% of the new users without a session are sent to the green target group up to 12 hours running version 2, as illustrated in the following diagram.

Blue-green deployment architecture with 90% blue traffic and 10% green traffic.

Figure 2: Blue-green deployment architecture with 90% blue traffic and 10% green traffic.

When you’re confident that the new version is performing well and stable, you can update the target group weights for your blue and green target groups to be 0% and 100%, respectively, to ensure that all the traffic is shifted to your green target group. You may still see some traffic flowing into the blue target group for existing users with active session whose target group stickiness duration (in this case target group stickiness duration is 12 hours) has not expired.

Recommendation:  As illustrated above, target group stickiness duration still influences the traffic shift between blue and green targets. So we recommend you to reduce the target group stickiness duration from 12 hours to 5 minutes or less depending upon your use case to ensure that the existing users going to the blue target group also fully transition to the green target group at the earliest. Some of our customers are using target group stickiness duration as 5 minutes to shift their traffic to green target group  after successful canary testing.

The recommended value of stickiness may vary across application types. For example, for a typical 3-tier front-end deployment, lower  target group stickiness value is desirable. However, for middle tier deployment,  the target group stickiness duration value may need to be higher to account for longer transactions.

Blue/green deployments with Application Load Balancer weighted target group

For this use case, you want you perform blue/green deployment for your application to provide near-zero downtime release and rollback capabilities. You can create two weighted target groups called blue and green with the following weights applied as an initial configuration.

Blue/green deployment configuration with blue target group 100% and green target group 0%

Figure 3: Blue/green deployment configuration with blue target group 100% and green target group 0%

When you’re ready to perform the deployment, you can change the weights for blue and green targets groups to be 0% and 100%, respectively, to shift the traffic completely to your newer version of the application.

Blue/green deployment configuration with blue target group 0% and green target group 100%

Figure 4: Blue/green deployment configuration with blue target group 0% and green target group 100%

When you’re performing blue/green deployment using weighted target groups, the recommendation is to not enable target group level stickiness so that traffic shifts immediately from the blue target group to the green target group. See the following CLI command:

aws elbv2 modify-listener \
    --listener-arn "<LISTENER ARN>" \
    --default-actions \
    '[{
       "Type": "forward",
       "Order": 1,
       "ForwardConfig": {
          "TargetGroups": [
             {"TargetGroupArn": "<Blue Target Group>", "Weight": 0}, \
             {"TargetGroupArn": "<Green Target Group>", "Weight": 100}, \
          ]
       }
    }]'

The following diagram shows the updated architecture.

Blue-green deployment architecture with 0% blue traffic and 100% green traffic

Figure 5: Blue-green deployment architecture with 0% blue traffic and 100% green traffic

If you need to enable target group level stickiness, you can ensure that all traffic transitions from the blue target group to the green target group by keeping the target group level stickiness duration as low as possible (5 minutes or less).

In the following code, the target group level stickiness is enabled for a duration of 5 minutes and traffic is completely shifted from the blue target group to the green target group:

aws elbv2 modify-listener \
    --listener-arn "<LISTENER ARN> " \
    --default-actions \
    '[{
       "Type": "forward",
       "Order": 1,
       "ForwardConfig": {
          "TargetGroups": [
             {"TargetGroupArn": "<Blue Target Group>", "Weight": 0}, \
             {"TargetGroupArn": "<Green Target Group>", "Weight": 100}, \
          ],
          "TargetGroupStickinessConfig": {
             "Enabled": true,
             "DurationSeconds": 300
          }
       }
    }]'

The existing users with connection stickiness to the blue target group continue to the blue target group until the 5-minute duration elapses from the last request time.

Recommendation:  As illustrated above, target group stickiness duration still influences the traffic shift between blue and green targets. So we recommend you to reduce the target group stickiness duration from  5 minutes to 1 minute or less depending upon your use case to ensure that all users transition into the green target group at the earliest.

As recommended above, the recommended value of stickiness may vary across application types.

Connection draining

To provide near-zero downtime release with blue/green deployment, you want to avoid breaking open network connections while taking an instance out of service, updating its software, or replacing it with a fresh instance that contains updated software.  In the above use cases, you can ensure graceful transition between  blue and green target groups by enabling the connection draining feature for your Elastic Load Balancers. You can do this from the AWS Management Console, the AWS CLI, or by calling the ModifyLoadBalancerAttributes function in the Elastic Load Balancing API. You can enable the feature and enter a timeout between 1 second and 1 hour. The connection time out duration depends upon your application profile. If  your application is stateless like your customers are using your website, connection time out duration of lowest value is preferable. Applications that are  transactions heavy  and connection oriented sessions like web sockets, we recommend you to choose relatively high connection draining duration as it will impact the customer experience adversely.

Load balancer stickiness

In addition to the target group level stickiness, Application Load Balancer also supports load balancer level stickiness. When a load balancer first receives a request from a client, it routes the request to a target, generates a cookie named AWSALB that encodes information about the selected target, encrypts the cookie, and includes the cookie in the response to the client. The client should include the cookie that it receives in subsequent requests to the load balancer. When the load balancer receives a request from a client that contains the cookie, if sticky sessions are enabled for the target group and the request goes to the same target, the load balancer detects the cookie and routes the request to the same target. If the cookie is present but can’t be decoded, or if it refers to a target that was deregistered or is unhealthy, the load balancer selects a new target and updates the cookie with information about the new target.

You can enable Application Load Balancer stickiness using the AWS CLI or the console. You can specify a value between 1 second–7 days.

In the context of blue/green and canary deployments, the load balancer stickiness has no influence on the traffic shifting behavior using the weighted target groups because target group stickiness takes precedence over load balancer stickiness.

Conclusion

In this post, we showed how to perform canary and blue/green deployments with Application Load Balancer’s weighted target group feature and how target group level stickiness impacts your canary and blue/green deployments. We also demonstrated how quickly you can enable ELB connection draining to provide near-zero downtime release with blue/green deployment. We hope that you find these recommendations helpful when you build a blue/green deployment with Application Load Balancer. You can reach out to AWS Solutions Architects and AWS Support teams for further assistance.

 

Raghavarao Sodabathina is an Enterprise Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and Serverless Platform. He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

 

 

Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers, supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security. Outside of work, he enjoys outdoor activities and watching documentaries.

 

 

 

TAGS: blue-green deployments