Tag Archives: Amazon Elasticsearch Service

Intelligently Search Media Assets with Amazon Rekognition and Amazon ES

Post Syndicated from Sridhar Chevendra original https://aws.amazon.com/blogs/architecture/intelligently-search-media-assets-with-amazon-rekognition-and-amazon-es/

Media assets have become increasingly important to industries like media and entertainment, manufacturing, education, social media applications, and retail. This is largely due to innovations in digital marketing, mobile, and ecommerce.

Successfully locating a digital asset like a video, graphic, or image reduces costs related to reproducing or re-shooting. An efficient search engine is critical to quickly delivering something like the latest fashion trends. This in turn increases customer satisfaction, builds brand loyalty, and helps increase businesses’ online footprints, ultimately contributing towards revenue.

This blog post shows you how to build automated indexing and search functions using AWS serverless managed artificial intelligence (AI)/machine learning (ML) services. This architecture provides high scalability, reduces operational overhead, and scales out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

Automatic tagging and rich metadata with Amazon ES

Asset libraries for images and videos are growing exponentially. With Amazon Elasticsearch Service (Amazon ES), this media is indexed and organized, which is important for efficient search and quick retrieval.

Adding correct metadata to digital assets based on enterprise standard taxonomy will help you narrow down search results. This includes information like media formats, but also richer metadata like location, event details, and so forth. With Amazon Rekognition, an advanced ML service, you do not need to tag and index these media assets. This automatic tagging and organization frees you up to gain insights like sentiment analysis from social media.

Figure 1 is tagged using Amazon Rekognition. You can see how rich metadata (Apparel, T-Shirt, Person, and Pills) is extracted automatically. Without Amazon Rekognition, you would have to manually add tags and categorize the image. This means you could only do a keyword search on what’s manually tagged. If the image was not tagged, then you likely wouldn’t be able to find it in a search.

Figure 1. An image tagged automatically with Amazon Rekognition

Figure 1. An image tagged automatically with Amazon Rekognition

Data ingestion, organization, and storage with Amazon S3

As shown in Figure 2, use Amazon Simple Storage Service (Amazon S3) to store your static assets. It provides high availability and scalability, along with unlimited storage. When you choose Amazon S3 as your content repository, multiple data providers are configured for data ingestion for future consumption by downstream applications. In addition to providing storage, Amazon S3 lets you organize data into prefixes based on the event type and captures S3 object mutations through S3 event notifications.

Figure 2. Solution overview diagram

Figure 2. Solution overview diagram

S3 event notifications are invoked for a specific prefix, suffix, or combination of both. They integrate with Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS Lambda as targets. (Refer to the Amazon S3 Event Notifications user guide for best practices). S3 event notification targets vary across use cases. For media assets, Amazon SQS is used to decouple the new data objects ingested into S3 buckets and downstream services. Amazon SQS provides flexibility over the data processing based on resource availability.

Data processing with Amazon Rekognition

Once media assets are ingested into Amazon S3, they are ready to be processed. Amazon Rekognition determines the entities within each asset. Amazon Rekognition then extracts the entities in JSON format and assigns a confidence score.

If the confidence score is below the defined threshold, use Amazon Augmented AI (A2I) for further review. A2I is an ML service that helps you build the workflows required for human review of ML predictions.

Amazon Rekognition also supports custom modeling to help identify entities within the images for specific business needs. For instance, a campaign may need images of products worn by a brand ambassador at a marketing event. Then they may need to further narrow their search down by the individual’s name or age demographic.

Using our solution, a Lambda function invokes Amazon Rekognition to extract the entities from the ingested assets. Lambda continuously polls the SQS queue for any new messages. Once a message is available, the Lambda function invokes the Amazon Rekognition endpoint to extract the relevant entities.

The following is a sample output from detect_labels API call in Amazon Rekognition and the transformed output that will be updated to downstream search engine:

{'Labels': [{'Name': 'Clothing', 'Confidence': 99.98137664794922, 'Instances': [], 'Parents': []}, {'Name': 'Apparel', 'Confidence': 99.98137664794922,'Instances': [], 'Parents': []}, {'Name': 'Shirt', 'Confidence': 97.00833129882812, 'Instances': [], 'Parents': [{'Name': 'Clothing'}]}, {'Name': 'T-Shirt', 'Confidence': 76.36670684814453, 'Instances': [{'BoundingBox': {'Width': 0.7963646650314331, 'Height': 0.6813027262687683, 'Left':
0.09593021124601364, 'Top': 0.1719706505537033}, 'Confidence': 53.39663314819336}], 'Parents': [{'Name': 'Clothing'}]}], 'LabelModelVersion': '2.0', 'ResponseMetadata': {'RequestId': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Mon, 17 May 2021 18:32:27 GMT', 'x-amzn-requestid': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6','content-length': '542', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

As shown, the Lambda function submits an API call to Amazon Rekognition, where a T-shirt image in .jpeg format is provided as the input. Based on your confidence score threshold preference, Amazon Rekognition will prompt you to initiate a human review using Amazon A2I. It will also prompt you to use Amazon Rekognition Custom Labels to train the custom models. Lambda then identifies and arranges the labels and updates the specified index.

Indexing with Amazon ES

Amazon ES is a managed search engine service that provides enterprise-grade search engine capability for applications. In our solution, assets are searched based on entities that are used as metadata to update the index. Amazon ES is hosted as a public endpoint or a VPC endpoint for secure access within the specified AWS account.

Labels are identified and marked as tags, which are assigned to .jpeg formatted images. The following sample output shows the query on one of the tags issued on an Amazon ES cluster.


curl-XGET https://<ElasticSearch Endpoint>/<_IndexName>/_search?q=T-Shirt



In addition to photos, Amazon Rekognition also detects the labels on videos. It can recognize labels and identify characters and entities. These are then added to Amazon ES to enhance search capability. This allows users to skip to specific parts of a video for quick searchability. For instance, a marketer may need images of cashmere sweaters from a fashion show that was streamed and recorded.

Once the raw video clip is identified, it is then converted using Amazon Elastic Transcoder to play back on mobile devices, tablets, web browsers, and connected televisions. Elastic Transcoder is a highly scalable and cost-effective media transcoding service in the cloud. Segmented output renditions are created for delivery using the multiple protocols to compatible devices.


This blog describes AWS services that can be applied to diverse set of use cases for tagging and efficient search of images and videos. You can build automated indexing and search using AWS serverless managed AI/ML services. They provide high scalability, reduce operational overhead, and scale out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

To get started, use these references to create your own sample architectures:

Configure SAML single sign-on for Kibana with AD FS on Amazon Elasticsearch Service

Post Syndicated from Sajeev Attiyil Bhaskaran original https://aws.amazon.com/blogs/security/configure-saml-single-sign-on-for-kibana-with-ad-fs-on-amazon-elasticsearch-service/

It’s a common use case for customers to integrate identity providers (IdPs) with Amazon Elasticsearch Service (Amazon ES) to achieve single sign-on (SSO) with Kibana. This integration makes it possible for users to leverage their existing identity credentials and offers administrators a single source of truth for user and permissions management. In this blog post, we’ll discuss how you can configure Security Assertion Markup Language (SAML) authentication for Kibana by using Amazon ES and Microsoft Active Directory Federation Services (AD FS).

Amazon ES now natively supports SSO authentication that uses the SAML protocol. With SAML authentication for Kibana, users can integrate directly with their existing third-party IdPs, such as Okta, Ping Identity, OneLogin, Auth0, AD FS, AWS Single Sign-on, and Azure Active Directory. SAML authentication for Kibana is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch, and is available to all Amazon ES customers who have enabled fine-grained access controls.

When you set up SAML authentication with Kibana, you can configure authentication that uses either service provider (SP)-initiated SSO or IdP-initiated SSO. The SP-initiated SSO flow occurs when a user directly accesses any SAML-configured Kibana endpoint, at which time Amazon ES redirects the user to their IdP for authentication, followed by a redirect back to Amazon ES after successful authentication. An IdP-initiated SSO flow typically occurs when a user chooses a link that first initiates the sign-in flow at the IdP, skipping the redirect between Amazon ES and the IdP. This blog post will focus on the SAML SP-initiated SSO flow.


To complete this walkthrough, you must have the following:

Solution overview

For the solution presented in this post, you use your existing AD FS as an IdP for the user’s authentication. The SAML federation uses a claim-based authentication model in which user attributes (in this case stored in Active Directory) are passed from the IdP (AD FS) to the SP (Kibana).

Let’s walk through how a user would use the SAML protocol to access Amazon ES Kibana (the SP) while using AD FS as the IdP. In Figure 1, the user authentication request comes from an on-premises network, which is connected to Amazon VPC through a VPN connection—in this case, this could also be over AWS Direct Connect. The Amazon ES domain and AD FS are created in the same VPC.

Figure 1: A high-level view of a SAML transaction between Amazon ES and AD FS

Figure 1: A high-level view of a SAML transaction between Amazon ES and AD FS

The initial sign-in flow is as follows:

  1. Open a browser on the on-premises computer and navigate to the Kibana endpoint for your Amazon ES domain in the VPC.
  2. Amazon ES generates a SAML authentication request for the user and redirects it back to the browser.
  3. The browser redirects the SAML authentication request to AD FS.
  4. AD FS parses the SAML request and prompts user to enter credentials.
    1. User enters credentials and AD FS authenticates the user with Active Directory.
    2. After successful authentication, AD FS generates a SAML response and returns the encoded SAML response to the browser. The SAML response contains the destination (the Assertion Consumer Service (ACS) URL), the authentication response issuer (the AD FS entity ID URL), the digital signature, and the claim (which user is authenticated with AD FS, the user’s NameID, the group, the attribute used in SAML assertions, and so on).
  5. The browser sends the SAML response to the Kibana ACS URL, and then Kibana redirects to Amazon ES.
  6. Amazon ES validates the SAML response. If all the validations pass, you are redirected to the Kibana front page. Authorization is performed by Kibana based on the role mapped to the user. The role mapping is performed based on attributes of the SAML assertion being consumed by Kibana and Amazon ES.

Deploy the solution

Now let’s walk through the steps to set up SAML authentication for Kibana single sign-on by using Amazon ES and Microsoft AD FS.

Enable SAML for Amazon Elasticsearch Service

The first step in the configuration setup process is to enable SAML authentication in the Amazon ES domain.

To enable SAML for Amazon ES

  1. Sign in to the Amazon ES console and choose any existing Amazon ES domain that meets the criteria described in the Prerequisites section of this post.
  2. Under Actions, select Modify Authentication.
  3. Select the Enable SAML authentication check box.
    Figure 2: Enable SAML authentication

    Figure 2: Enable SAML authentication

    When you enable SAML, it automatically creates and displays the different URLs that are required to configure SAML support in your IdP.

    Figure 3: URLs for configuring the IdP

    Figure 3: URLs for configuring the IdP

  4. Look under Configure your Identity Provider (IdP), and note down the URL values for Service provider entity ID and SP-initiated SSO URL.

Set up and configure AD FS

During the SAML authentication process, the browser receives the SAML assertion token from AD FS and forwards it to the SP. In order to pass the claims to the Amazon ES domain, AD FS (the claims provider) and the Amazon ES domain (the relying party) have to establish a trust between them. Then you define the rules for what type of claims AD FS needs to send to the Amazon ES domain. The Amazon ES domain authorizes the user with internal security roles or backend roles, according to the claims in the token.

To configure Amazon ES as a relying party in AD FS

  1. Sign in to the AD FS server. In Server Manager, choose Tools, and then choose AD FS Management.
  2. In the AD FS management console, open the context (right-click) menu for Relying Party Trust, and then choose Add Relying Party Trust.

    Figure 4: Set up a relying party trust

    Figure 4: Set up a relying party trust

  3. In the Add Relying Party Trust Wizard, select Claims aware, and then choose Start.

    Figure 5: Create a claims aware application

    Figure 5: Create a claims aware application

  4. On the Select Data Source page, choose Enter data about the relying party manually, and then choose Next.

    Figure 6: Enter data about the relying party manually

    Figure 6: Enter data about the relying party manually

  5. On the Specify Display Name page, type in the display name of your choice for the relying party, and then choose Next. Choose Next again to move past the Configure Certificate screen. (Configuring a token encryption certificate is optional and at the time of writing, Amazon ES doesn’t support SAML token encryption.)

    Figure 7: Provide a display name for the relying party

    Figure 7: Provide a display name for the relying party

  6. On the Configure URL page, do the following steps.
    1. Choose the Enable support for the SAML 2.0 WebSSO protocol check box.
    2. In the URL field, add the SP-initiated SSO URL that you noted when you enabled SAML authentication in Amazon ES earlier.
    3. Choose Next.

      Figure 8: Enable SAML support and provide the SP-initiated SSO URL

      Figure 8: Enable SAML support and provide the SP-initiated SSO URL

  7. On the Configure Identifiers page, do the following:
      1. For Relying party trust identifier, provide the service provider entity ID that you noted when you enabled SAML authentication in Amazon ES.
      2. Choose Add, and then choose Next.


    Figure 9: Provide the service provider entity ID

    Figure 9: Provide the service provider entity ID

  8. On the Choose Access Control Policy page, choose the appropriate access for your domain. Depending on your requirements, choose one of these options:
    • Choose Permit Specific Group to restrict access to one or more groups in your Active Directory domain based on the Active Directory group.
    • Choose Permit Everyone to allow all Active Directory domain users to access Kibana.

    Note: This step only provides access for the users to authenticate into Kibana. You have not yet set up Open Distro security roles and permissions.


    Figure 10: Choose an access control policy

    Figure 10: Choose an access control policy

  9. On the Ready to Add Trust page, choose Next, and then choose Close.

Now you’ve finished adding Amazon ES as a relying party trust.

To configure claim issuance rules for the relying party during the authentication process, AD FS sends user attributes—claims—to the relying party. With claim rules, you define what claims AD FS can send to the Amazon ES domain. In the following procedure, you create two claim rules: one is to send the incoming Windows account name as the Name ID and the other is to send Active Directory groups as roles.

To configure claim issuance rules

  1. On the Relying Party Trusts page, right-click the relying party trust (in this case, AWS_ES_Kibana) and choose Edit Claim Issuance Policy.

    Figure 11: Edit the claim issuance policy

    Figure 11: Edit the claim issuance policy

  2. Configure the claim rule to send the Windows account name as the Name ID, using these steps.
    1. In the Edit Claim Issuance Policy dialog box, choose Add Rule. The Add Transform Claim Rule Wizard opens.
    2. For Rule Type, choose Transform an Incoming Claim, and then choose Next.
    3. On the Configure Rule page, enter the following information:
      • Claim rule name: NameId
      • Incoming claim type: Windows account name
      • Outgoing claim type: Name ID
      • Outgoing name ID format: Unspecified
      • Pass through all claim values: Select this option
    4. Choose Finish.


    Figure 12: Set the claim rule for Name ID

    Figure 12: Set the claim rule for Name ID

  3. Configure Active Directory groups to send as roles, using the following steps.
    1. In the Edit Claim Issuance Policy dialog box, choose Add Rule. The Add Transform Claim Rule Wizard opens.
    2. For Rule Type, choose Send LDAP Attributes as Claims, and then choose Next.
    3. On the Configure Rule page, enter or choose the following settings:
      • Claim rule name: Send-Groups-as-Roles
      • Attribute store: Active Directory
      • LDAP attribute: Token-Groups – Unqualified Names (to select the group name)
      • Outgoing claim type: Roles (the value for Roles should match the Roles Key that you will set in the Configure SAML in the Amazon ES domain step later in this process)
    4. Choose Finish

      Figure 13: Set claim rule for Active Directory groups as Roles

      Figure 13: Set claim rule for Active Directory groups as Roles

The configuration of AD FS is now complete and you can download the SAML metadata file from AD FS. The SAML metadata is in XML format and is needed to configure SAML in the Amazon ES domain. The AD FS metadata file (the IdP metadata) can be accessed from the following link (replace <AD FS FQDN> with the domain name of your AD FS server). Copy the XML and note down the value of entityID from the XML, as shown in Figure 14. You will need this information in the next steps.

https://<AD FS FQDN>/FederationMetadata/2007-06/FederationMetadata.xml


Figure 14: The value of entityID in the XML file

Figure 14: The value of entityID in the XML file

Configure SAML in the Amazon ES domain

Next, you configure SAML settings in the Amazon Elasticsearch Service console. You need to import the IdP metadata, configure the IdP entity ID, configure the backend role, and set up the Roles key.

To configure SAML setting in the Amazon ES domain

    1. Sign in to the Amazon Elasticsearch Service console. On the Actions menu, choose Modify authentication.
    2. Import the IdP metadata, using the following steps.
      1. Choose Import IdP metadata, and then choose Metadata from IdP.
      2. Paste the contents of the FederationMetadata XML file (the IdP metadata) that you copied earlier in the Add or edit metadata field. You can also choose the Import from XML file button if you have the metadata file on the local disk.

        Figure 15: The imported identity provider metadata

        Figure 15: The imported identity provider metadata

    3. Copy and paste the value of entityID from the XML file to the IdP entity ID field, if that field isn’t autofilled.
    4. For SAML manager backend role (the console may refer to this as master backend role), enter the name of the group you created in AD FS as part of the prerequisites for this post. In this walkthrough, we set the name of the group as admins, and therefore the backend role is admins.

Optionally, you can also provide the user name instead of the backend role.

  1. Set up the Roles key, using the following steps.
    1. Under Optional SAML settings, for Roles key, enter Roles. This value must match the value for Outgoing claim type, which you set when you configured claims rules earlier.

      Figure 16: Set the Roles key

      Figure 16: Set the Roles key

    2. Leave the Subject key field empty to use the NameID element of the SAML assertion for the user name. Keep the defaults for everything else, and then choose Submit.

It can take few minutes to update the SAML settings and for the domain to come back to the active state.

Congratulations! You’ve completed all the SP and IdP configurations.

Sign in to Kibana

When the domain comes back to the active state, choose the Kibana URL in the Amazon ES console. You will be redirected to the AD FS sign-in page for authentication. Provide the user name and password for any of the users in the admins group. The example in Figure 17 uses the credentials for the user [email protected], who is a member of the admins group.

Figure 17: The AD FS sign-in screen with user credentials

Figure 17: The AD FS sign-in screen with user credentials

AD FS authenticates the user and redirect the page to Kibana. If the user has at least one role mapped, you go to the Kibana home page, as shown in Figure 18. In this walkthrough, you mapped the AD FS group admins as a backend role to the manager user. Internally, the Open Distro security plugin maps the backend role admins to the security roles all_access and security_manager. Therefore, the Active Directory user in the admins group is authorized with the privileges of the manager user in the domain. For more granular access, you can create different AD FS groups and map the group names (backend roles) to internal security roles by using Role Mappings in Kibana.

Figure 18: The AD FS user user1@example.com is successfully logged in to Kibana

Figure 18: The AD FS user [email protected] is successfully logged in to Kibana

Note: At the time of writing for this blog post, if you specify the <SingleLogoutService /> details in the AD FS metadata XML, when you sign out from Kibana, Kibana will call AD FS directly and try to sign the user out. This doesn’t work currently, because AD FS expects the sign-out request to be signed with a certificate that Amazon ES doesn’t currently support. If you remove <SingleLogoutService /> from the metadata XML file, Amazon ES will use its own internal sign-out mechanism and sign the user out on the Amazon ES side. No calls will be made to AD FS for signing out.


In this post, we covered setting up SAML authentication for Kibana single sign-on by using Amazon ES and Microsoft AD FS. The integration of IdPs with your Amazon ES domain provides a powerful way to control fine-grained access to your Kibana endpoint and integrate with existing identity lifecycle processes for create/update/delete operations, which reduces the operational overhead required to manage users.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Sajeev Attiyil Bhaskaran

Sajeev works closely with AWS customers to provide them architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions and leads onsite and online sessions for customers to design best solutions for their use cases. Outside of work, he enjoys spending time with his wife and daughter.


Jagadeesh Pusapadi

Jagadeesh is a Senior Solutions Architect with AWS working with customers on their strategic initiatives. He helps customers build innovative solutions on the AWS Cloud by providing architectural guidance to achieve desired business outcomes.

Automate Amazon ES synonym file updates

Post Syndicated from Ashwini Rudra original https://aws.amazon.com/blogs/big-data/automate-amazon-es-synonym-file-updates/

Search engines provide the means to retrieve relevant content from a collection of content. However, this can be challenging if certain exact words aren’t entered. You need to find the right item from a catalog of products, or the correct provider from a list of service providers, for example. The most common method of specifying your query is through a text box. If you enter the wrong terms, you won’t match the right items, and won’t get the best results.

Synonyms enable better search results by matching words that all match to a single indexable term. In Amazon Elasticsearch Service (Amazon ES), you can provide synonyms for keywords that your application’s users may look for. For example, your website may provide medical practitioner searches, and your users may search for “child’s doctor” instead of “pediatrician.” Mapping the two words together enables either search term to match documents that contain the term “pediatrician.” You can achieve similar search results by using synonym files. Amazon ES custom packages allow you to upload synonym files that define the synonyms in your catalog. One best practice is to manage the synonyms in Amazon Relational Database Service (Amazon RDS). You then need to deploy the synonyms to your Amazon ES domain. You can do this with AWS Lambda and Amazon Simple Storage Service (Amazon S3).

In this post, we discuss an approach using Amazon Aurora and Lambda functions to automate updating synonym files for improved search results.

Overview of solution

Amazon ES is a fully managed service that makes it easy to deploy, secure, and run Elasticsearch cost-effectively and at scale. You can build, monitor, and troubleshoot your applications using the tools you love, at the scale you need. The service supports open-source Elasticsearch API operations, managed Kibana, integration with Logstash, and many AWS services with built-in alerting and SQL querying.

The following diagram shows the solution architecture. One Lambda function pushes files to Amazon S3, and another function distributes the updates to Amazon ES.

Walkthrough overview

For search engineers, the synonym file’s content is usually stored within a database or in a data lake. You may have data in tabular format in Amazon RDS (in this case, we use Amazon Aurora MySQL). When updates to the synonym data table occur, the change triggers a Lambda function that pushes data to Amazon S3. The S3 event triggers a second function, which pushes the synonym file from Amazon S3 to Amazon ES. This architecture automates the entire synonym file update process.

To achieve this architecture, we complete the following high-level steps:

  1. Create a stored procedure to trigger the Lambda function.
  2. Write a Lambda function to verify data changes and push them to Amazon S3.
  3. Write a Lambda function to update the synonym file in Amazon ES.
  4. Test the data flow.

We discuss each step in detail in the next sections.


Make sure you complete the following prerequisites:

  1. Configure an Amazon ES domain. We use a domain running Elasticsearch version 7.9 for this architecture.
  2. Set up an Aurora MySQL database. For more information, see Configuring your Amazon Aurora DB cluster.

Create a stored procedure to trigger a Lambda function

You can invoke a Lambda function from an Aurora MySQL database cluster using a native function or a stored procedure.

The following script creates an example synonym data table:

CREATE TABLE SynonymsTable (
Base_Term varchar(255),
Synonym_1 varchar(255),
Synonym_2 varchar(255),

You can now populate the table with sample data. To generate sample data in your table, run the following script:

INSERT INTO SynonymsTable(Base_Term, Synonym_1, Synonym_2)
VALUES ( 'danish', 'croissant', 'pastry')

Create a Lambda function

You can use two different methods to send data from Aurora to Amazon S3: a Lambda function or SELECT INTO OUTFILE S3.

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function that is called every time a change occurs that must be tracked in the database table. This function passes the data to Amazon S3. First create an S3 bucket where you store the synonym file using the Lambda function.

When you create your function, make sure you give the right permissions using an AWS Identity and Access Management (IAM) role for the S3 bucket. These permissions are for the Lambda execution role and S3 bucket where you store the synonyms.txt file. By default, Lambda creates an execution role with minimal permissions when you create a function on the Lambda console. The following is the Python code to create the synonyms.txt file in S3:

import boto3
import json
import botocore
from botocore.exceptions import ClientError

s3_resource = boto3.resource('s3')

filename = 'synonyms.txt'
BucketName = '<<provide your bucket name>>
local_file = '/tmp/test.txt'

def lambda_handler(event, context):
    S3_data = (("%s,%s,%s \n") %(event['Base_Term'], event['Synonym_1'], event['Synonym_2']))
    # open  a file and append new line
    except ClientError as e:
        if e.response['Error']['Code'] == "404":
            # create a new file if file does not exits 
            s3_resource.meta.client.put_object(Body=S3_data, Bucket= BucketName,Key=filename)
            # append file
    with open('/tmp/test.txt', 'a') as fd:
    s3_resource.meta.client.upload_file('/tmp/test.txt', BucketName, filename)

Note the Amazon Resource Name (ARN) of this Lambda function to use in a later step.

Give Aurora permissions to invoke a Lambda function

To give Aurora permissions to invoke your function, you must attach an IAM role with the appropriate permissions to the cluster. For more information, see Invoking a Lambda function from an Amazon Aurora DB cluster.

When you’re finished, the Aurora database has access to invoke a Lambda function.

Create a stored procedure and a trigger in Aurora

To create a new stored procedure, return to MySQL Workbench. Change the ARN in the following code to your Lambda function’s ARN before running the procedure:

CREATE PROCEDURE Syn_TO_S3 (IN SysID INT,IN Base_Term varchar(255),IN Synonym_1 varchar(255),IN Synonym_2 varchar(255)) LANGUAGE SQL
   CALL mysql.lambda_async('<<Lambda-Funtion-ARN>>,
    CONCAT('{ "SysID ": "', SysID,
    '", "Base_Term" : "', Base_Term,
    '", "Synonym_1" : "', Synonym_1,
    '", "Synonym_2" : "', Synonym_2,'"}')

When this stored procedure is called, it invokes the Lambda function you created.

Create a trigger TR_SynonymTable_CDC on the table SynonymTable. When a new record is inserted, this trigger calls the Syn_TO_S3 stored procedure. See the following code:

  AFTER INSERT ON SynonymsTable
  SELECT  NEW.SynID, NEW.Base_Term, New.Synonym_1, New.Synonym_2 
  INTO @SynID, @Base_Term, @Synonym_1, @Synonym_2;
  CALL  Syn_TO_S3(@SynID, @Base_Term, @Synonym_1, @Synonym_2);

If a new row is inserted in SynonymsTable, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the function to Amazon S3 successfully. You may have to insert a few records, depending on the size of your data, before new records appear in Amazon S3.

Update synonyms in Amazon ES when a new synonym file becomes available

Amazon ES lets you upload custom dictionary files (for example, stopwords and synonyms) for use with your cluster. The generic term for these types of files is packages. Before you can associate a package with your domain, you must upload it to an S3 bucket. For instructions on uploading a synonym file for the first time and associating it to an Amazon ES domain, see Uploading packages to Amazon S3 and Importing and Associating packages.

To update the synonyms (package) when a new version of the synonym file becomes available, we complete the following steps:

  1. Create a Lambda function to update the existing package.
  2. Set up an S3 event notification to trigger the function.

Create a Lambda function to update the existing package

We use a Python-based Lambda function that uses the Boto3 AWS SDK for updating the Elasticsearch package. For more information about how to create a Python-based Lambda function, see Building Lambda functions with Python. You need the following information before we start coding for the function:

  • The S3 bucket ARN where the new synonym file is written
  • The Amazon ES domain name (available on the Amazon ES console)

  • The package ID of the Elasticsearch package we’re updating (available on the Amazon ES console)

You can use the following code for the Lambda function:

import logging
import boto3
import os

# Elasticsearch client
client = boto3.client('es')
# set up logging
logger = logging.getLogger('boto3')
# fetch from Environment Variable
package_id = os.environ['PACKAGE_ID']
es_domain_nm = os.environ['ES_DOMAIN_NAME']

def lambda_handler(event, context):
    s3_bucket = event["Records"][0]["s3"]["bucket"]["name"]
    s3_key = event["Records"][0]["s3"]["object"]["key"]
    logger.info("bucket: {}, key: {}".format(s3_bucket, s3_key))
    # update package with the new Synonym file.
    up_response = client.update_package(
            'S3BucketName': s3_bucket,
            'S3Key': s3_key
        CommitMessage='New Version: ' + s3_key
    logger.info('Response from Update_Package: {}'.format(up_response))
    # check if the package update is completed
    finished = False
    while finished == False:
        # describe the package by ID
        desc_response = client.describe_packages(
                    'Name': 'PackageID',
                    'Value': [package_id]
        status = desc_response['PackageDetailsList'][0]['PackageStatus']
        logger.info('Package Status: {}'.format(status))
        # check if the Package status is back to available or not.
        if status == 'AVAILABLE':
            finished = True
            logger.info('Package status is now Available. Exiting loop.')
            finished = False
    logger.info('Package: {} update is now Complete. Proceed to Associating to ES Domain'.format(package_id))
    # once the package update is completed, re-associate with the ES domain
    # so that the new version is applied to the nodes.
    ap_response = client.associate_package(
    logger.info('Response from Associate_Package: {}'.format(ap_response))
    return {
        'statusCode': 200,
        'body': 'Custom Package Updated.'

The preceding code requires environment variables to be set to the appropriate values and the IAM execution role assigned to the Lambda function.

Set up an S3 event notification to trigger the Lambda function

Now we set up event notification (all object create events) for the S3 bucket in which the updated synonym file is uploaded. For more information about how to set up S3 event notifications with Lambda, see Using AWS Lambda with Amazon S3.

Test the solution

To test our solution, let’s consider an Elasticsearch index (es-blog-index-01) that consists of the following documents:

  • tennis shoe
  • hightop
  • croissant
  • ice cream

A synonym file is already associated with the Amazon ES domain via Amazon ES custom packages and the index (es-blog-index-01) has the synonym file in the settings (analyzer, filter, mappings). For more information about how to associate a file to an Amazon ES domain and use it with the index settings, see Importing and associating packages and Using custom packages with Elasticsearch. The synonym file contains the following data:

danish, croissant, pastry

Test 1: Search with a word present in the synonym file

For our first search, we use a word that is present in the synonym file. The following screenshot shows that searching for “danish” brings up the document croissant based on a synonym match.

Test 2: Search with a synonym not present in the synonym file

Next, we search using a synonym that’s not in the synonym file. In the following screenshot, our search for “gelato” yields no result. The word “gelato” doesn’t match with the document ice cream because no synonym mapping is present for it.

In the next test, we add synonyms for “ice cream” and perform the search again.

Test 3: Add synonyms for “ice cream” and redo the search

To add the synonyms, let’s insert a new record into our database. We can use the following SQL statement:

INSERT INTO SynonymsTable(Base_Term, Synonym_1, Synonym_2)
VALUES ('frozen custard', 'gelato', 'ice cream')

When we search with the word “gelato” again, we get the ice cream document.

This confirms that the synonym addition is applied to the Amazon ES index.

Clean up resources

To avoid ongoing charges to your AWS account, remove the resources you created:

  1. Delete the Amazon ES domain.
  2. Delete the RDS DB instance.
  3. Delete the S3 bucket.
  4. Delete the Lambda functions.


In this post, we implemented a solution using Aurora, Lambda, Amazon S3, and Amazon ES that enables you to update synonyms automatically in Amazon ES. This provides central management for synonyms and ensures your users can obtain accurate search results when synonyms are changed in your source database.

About the Authors

Ashwini Rudra is a Solutions Architect at AWS. He has more than 10 years of experience architecting Windows workloads in on-premises and cloud environments. He is also an AI/ML enthusiast. He helps AWS customers, namely major sports leagues, define their cloud-first digital innovation strategy.



Arnab Ghosh is a Solutions Architect for AWS in North America helping enterprise customers build resilient and cost-efficient architectures. He has over 13 years of experience in architecting, designing, and developing enterprise applications solving complex business problems.



Jennifer Ng is an AWS Solutions Architect working with enterprise customers to understand their business requirements and provide solutions that align with their objectives. Her background is in enterprise architecture and web infrastructure, where she has held various implementation and architect roles in the financial services industry.

Increase Amazon Elasticsearch Service performance by upgrading to Graviton2

Post Syndicated from Zachariah Elliott original https://aws.amazon.com/blogs/big-data/increase-amazon-elasticsearch-service-performance-by-upgrading-to-graviton2/

Amazon Elasticsearch Service (Amazon ES) supports multiple instance types based on your use case. In 2021, AWS announced general purpose (M6g), compute optimized (C6g), and memory optimized (R6g, R6gd) instance types for Amazon ES version 7.9 or later powered by AWS Graviton2 processors, which delivers a major leap in capabilities and better price/performance improvement over previous generation instances.

Graviton2 instances are built using custom silicon designed by Amazon. These instances are Amazon-designed hardware and software innovations that enable the delivery of efficient, flexible, and secure cloud services with isolated multi-tenancy, private networking, and fast local storage. You can launch Graviton2 instances via the Amazon ES console, the AWS Command Line Interface (AWS CLI), AWS API, AWS CloudFormation, or the AWS Cloud Development Kit (AWS CDK). You can change your existing Amazon ES instance types to Graviton2 using a blue/green deployment process, which minimizes downtime and maintains the original environment in the event of unsuccessful deployments.

In this post, we review prerequisites and considerations to upgrade your existing Amazon ES instances to Graviton2 with minimal downtime.

Why move to Graviton2?

The following are some of the reasons you should move to Graviton2:

  • You can enjoy up to 38% improvement in indexing throughput compared to the corresponding x86-based counterparts
  • The Graviton2 instance family provides up to 50% reduction in indexing latency, and up to 30% improvement in query performance when compared to the current generation (M5, C5, R5)
  • Amazon ES Graviton2 instances provide up to 44% price/performance improvement over previous generation instances
  • Graviton2 instances include support for all recently launched features like encryption at rest and in flight, role-based access control, cross-cluster search, Auto-Tune, Trace Analytics, Kibana Reporting, and UltraWarm

Solution overview

For this post, let’s consider a use case in which we have an Amazon ES cluster running version 7.4 with three data nodes and two primary nodes.

As a general best practice, we recommend testing the process in a non-production environment followed by validation tests to make sure everything is configured and operating as per your expectations before making changes to the production environment. We also recommend creating a snapshot of your cluster before performing upgrades or modifying the instance type to minimize the risk of data loss.

In this post, we walk you through the following steps:

  1. Upgrade the Amazon ES cluster (if needed):
    1. Determine if the current cluster version meets the minimum required version (7.9 or later) for moving to Graviton2.
    2. Upgrade the Amazon ES domain to the required minimum version.
  2. Modify the instance type of your cluster nodes.
  3. Confirm that your applications work correctly with the upgraded cluster.
  4. Roll back to the previous instance types if compatibility issues are discovered.

Upgrade Amazon ES versions

To take advantage of Graviton2-based Amazon ES instances, your cluster must be running Amazon ES version 7.9 and above and service software R20210331 or later (as of this post). For the latest updates of this information, see Supported instance types in Amazon Elasticsearch Service. For upgrade considerations, compatibilities, and instructions, see Upgrading Elasticsearch.

For our use case, our cluster is running version 7.4. We can confirm the version via the AWS CLI or Amazon ES console, as in the following screenshot.

To upgrade your domain, choose Upgrade domain on the Actions menu. You can then choose what version to upgrade to, or verify your cluster can be upgraded. The upgrade process takes some time depending on the size of your cluster.

If you prefer to use the AWS CLI, you can perform the same steps. To get a list of all valid upgrade targets for a current version using the AWS CLI, use the describe-elasticsearch-domain command.

The following describe-elasticsearch-domain example provides configuration details for a given domain:

aws es describe-elasticsearch-domain \
    --domain-name demo

If the cluster version is less than 7.9, use the upgrade-elasticsearch-domain command to upgrade your domain:

aws es upgrade-elasticsearch-domain \
--domain-name demo
--target-version 7.9

You can track the progress of the Amazon ES domain upgrade using API calls to Amazon ES. For more information, see Why is my Amazon Elasticsearch Service domain upgrade taking so long?

Modify instances

At the time of writing, you can’t mix x86 and Graviton2-based Amazon ES instances with the primary and data nodes. As such, both data nodes and primary nodes are modified at the same time. To modify your nodes, complete the following steps:

  1. On the Amazon ES console, go to the domain you want to upgrade.
  2. Choose Edit domain.

  1. In the Data nodes section, for Instance type, change your data nodes to Graviton 2 instance types. In our case, we upgrade from r5.large.elasticsearch to r6g.large.elasticsearch.

  1. In the Dedicated master nodes section, for Instance type, change your dedicated primary nodes to Graviton 2 instance types. In our case, we upgrade from r5.large.elasticsearch to r6g.large.elasticsearch.

  1. Choose Submit.

The cluster goes into a processing state. During this time, you can monitor the Cluster health tab to see your number of nodes increase. In our case, our cluster has two dedicated primary nodes and three data nodes (five total).

During deployment, Amazon ES performs a blue/green deployment. This ensures any errors encountered during modification can be rolled back. You can continue to use the cluster during this time, however there may be a brief service interruption when the cluster switches to the new dedicated primary nodes. During blue/green deployment, you’re charged for both instance types, and then only the new instance type going forward.

After the modification finishes successfully, you can verify both the primary and data nodes are using Graviton2 instances.

Validate and confirm the application works correctly

You can now validate Amazon ES is performing as expected with your application. You can check the Cluster health tab for metrics related to cluster performance and observe if you’re not seeing the expected performance.

Perform rollback

In the rare scenario in which issues are discovered with the Graviton2-based Amazon ES cluster, such as application compatibility or data issues, you can perform the same steps to change the cluster back to the original node type.


This post shared a step-by-step guide to migrate your Amazon ES cluster to Graviton2-based nodes, as well as some key considerations when modifying your cluster. We also talked about how to upgrade your cluster to the latest version of Amazon ES to take advantage of Graviton 2, as well as other features such as UltraWarm and cold storage. As always, make sure you fully test compatibility with your application and these newer versions of Amazon ES, and per best practices, always perform upgrades in a lower environment before making these changes in a production environment.

Additional resources

For more information, see the following:

About the Authors

Zachariah Elliott works as a Solutions Architect focusing on EdTech at AWS. He is passionate about helping customers build Well-Architected solutions on AWS. He is also part of the IoT Subject Matter Expert community at AWS and loves helping customers develop unique IoT-based solutions.


Pranusha Manchala is a Solutions Architect at AWS who works with education companies. She has worked with many EdTech customers and provided them with architectural guidance for building highly scalable and cost-optimized applications on AWS. She found her interests in machine learning and started to dive deep into this technology. She enjoys cooking, baking, and outdoor activities in her free time.

Preprocess logs for anomaly detection in Amazon ES

Post Syndicated from Kapil Pendse original https://aws.amazon.com/blogs/big-data/preprocess-logs-for-anomaly-detection-in-amazon-es/

Amazon Elasticsearch Service (Amazon ES) supports real-time anomaly detection, which uses machine learning (ML) to proactively detect anomalies in real-time streaming data. When used to analyze application logs, it can detect anomalies such as unusually high error rates or sudden changes in the number of requests. For example, a sudden increase in the number of food delivery orders from a particular area could be due to weather changes or due to a technical glitch experienced by users from that area. The detection of such an anomaly can facilitate quick investigation and remediation of the situation.

The anomaly detection feature of Amazon ES uses the Random Cut Forest algorithm. This is an unsupervised algorithm that constructs decision trees from numeric input data points in order to detect outliers in the data. These outliers are regarded as anomalies. To detect anomalies in logs, we have to convert the text-based log files into numeric values so that they can be interpreted by this algorithm. In ML terminology, such conversion is commonly referred to as data preprocessing. There are several methods of data preprocessing. In this post, I explain some of these methods that are appropriate for logs.

To implement the methods described in this post, you need a log aggregation pipeline that ingests log files into an Amazon ES domain. For information about ingesting Apache web logs, see Send Apache Web Logs to Amazon Elasticsearch Service with Kinesis Firehose. For a similar method for ingesting and analyzing Amazon Simple Storage Service (Amazon S3) server access logs, see Analyzing Amazon S3 server access logs using Amazon ES.

Now, let’s discuss some data preprocessing methods that we can use when dealing with complex structures within log files.

Log lines to JSON documents

Although they’re text files, usually log files have some structure to the log messages, with one log entry per line. As shown in the following image, a single line in a log file can be parsed and stored in an Amazon ES index as a document with multiple fields. This image is an example of how an entry in an Amazon S3 access log can be converted into a JSON document.

Although you can ingest JSON documents such as the preceding image as is into Amazon ES, some of the text fields require further preprocessing before you can use them for anomaly detection.

Text fields with nominal values

Let’s assume your application receives mostly GET requests and a much smaller number of POST requests. According to an OWASP security recommendation, it’s also advisable to disable TRACE and TRACK request methods because these can be misused for cross-site tracing. If you want to detect when unusual HTTP requests appear in your server logs, or when there is a sudden spike in the number of HTTP requests with methods that are normally a minority, you could do so by using the request_uri or operation fields in the preceding JSON document. These fields contain the HTTP request method, but you have to extract that and convert that into a numeric format that can be used for anomaly detection.

These are fields that have only a handful of different values, and those values don’t have any particular sequential order. If we simply convert HTTP methods to an ordered list of numbers, like GET = 1, POST = 2, and so on, we might confuse the anomaly detection algorithm into thinking that POST is somehow greater than GET, or that GET + GET equals POST. A better way to preprocess such fields is one-hot encoding. The idea is to convert the single text field into multiple binary fields, one for every possible value of the original text field. In our example, the result of this one-hot encoding is a set of nine binary fields. If the value of the field in the original log is HEAD, only the HEAD field in the preprocessed data has value 1, and all other fields are zero. The following table shows some examples.

Original Log Message Preprocessed into multiple one-hot encoded fields

HTTP Request Method

GET 1 0 0 0 0 0 0 0 0
POST 0 0 1 0 0 0 0 0 0
OPTIONS 0 0 0 0 0 0 1 0 0

These generated fields data can be then processed by the Amazon ES anomaly detection feature to detect anomalies when there is a change in the pattern of HTTP requests received by your application, for example an unusually high number of DELETE requests.

Text fields with a large number of nominal values

Many log files contain HTTP response codes, error codes, or some other type of numeric codes. These codes don’t have any particular order, but the number of possible values is quite large. In such cases, one-hot encoding alone isn’t suitable because it can cause an explosion in the number of fields in the preprocessed data.

Take for example the HTTP response codes. The values are unordered, meaning that there is no particular reason for 200 being OK and 400 being Bad Request. 200 + 200 != 400 as far as HTTP response codes go. However, the number of possible values is quite large—more than 60. If we use the one-hot encoding technique, we end up creating more than 60 fields out of this 1 field, and it quickly becomes unmanageable.

However, based on our knowledge of HTTP status codes, we know that these codes are by definition binned into five ranges. Codes in the range 100–199 are informational responses, codes 200–299 indicate successful completion of the request, 300–399 are redirections, 400–499 are client errors, and 500–599 are server errors. We can take advantage of this knowledge and reduce the original values to five values, one for each range (1xx, 2xx, 3xx, 4xx and 5xx). Now this set of five possible values is easier to deal with. The values are purely nominal. Therefore, we can additionally one-hot encode these values as described in the previous section. The result after this binning and one-hot encoding process is something like the following table.

Original Log Message Preprocessed into multiple fields after binning and one-hot encoding
HTTP Response Status Code 1xx 2xx 3xx 4xx 5xx
100 (Continue) 1 0 0 0 0
101 (Switching Protocols) 1 0 0 0 0
200 (OK) 0 1 0 0 0
202 (Accepted) 0 1 0 0 0
301 (Moved Permanently) 0 0 1 0 0
304 (Not Modified) 0 0 1 0 0
400 (Bad Request) 0 0 0 1 0
401 (Unauthorized) 0 0 0 1 0
404 (Not Found) 0 0 0 1 0
500 (Internal Server Error) 0 0 0 0 1
502 (Bad Gateway) 0 0 0 0 1
503 (Service Unavailable) 0 0 0 0 1

This preprocessed data is now suitable for use in anomaly detection. Spikes in 4xx errors or drops in 2xx responses might be especially important to detect.

The following Python code snippet shows how you can bin and one-hot encode HTTP response status codes:

def http_status_bin_one_hot_encoding(http_status):
    # returns one hot encoding based on http response status bin
    # bins are: 1xx, 2xx, 3xx, 4xx, 5xx
    if 100 <= http_status <= 199: # informational responses
        return (1, 0, 0, 0, 0)
    elif 200 <= http_status < 299: # successful responses
        return (0, 1, 0, 0, 0)
    elif 300 <= http_status < 399: # redirects
        return (0, 0, 1, 0, 0)
    elif 400 <= http_status < 499: # client errors
        return (0, 0, 0, 1, 0)
    elif 500 <= http_status < 599: # server errors
        return (0, 0, 0, 0, 1)

http_1xx, http_2xx, http_3xx, http_4xx, http_5xx = http_status_bin_one_hot_encoding(status)

log_entry = {
    'timestamp': timestamp,
    'bucket': "somebucket",
    'key': "somekey",
    'operation': "REST.GET.VERSIONING",
    'request_uri': "GET /awsexamplebucket1?versioning HTTP/1.1",
    'status_code': status,
    'http_1xx': http_1xx,
    'http_2xx': http_2xx,
    'http_3xx': http_3xx,
    'http_4xx': http_4xx,
    'http_5xx': http_5xx,
    'error_code': "-",
    'bytes_sent': 113,
    'object_size': 0

Text fields with ordinal values

Some text fields in log files contain values that have a relative sequence. For example, a log level field might contain values like TRACE, DEBUG, INFO, WARN, ERROR, and FATAL. This is a sequence of increasing severity of the log message. As shown in the following table, these string values can be converted to numeric values in a way that retains this relative sequence.

Log Level (Original Log Message)

Preprocessed Log Level













IP addresses

Log files often have IP addresses that can contain a large number of values, and it doesn’t make sense to bin these values together using the method described in the previous section. However, these IP addresses might be of interest from a geolocation perspective. It might be important to detect an anomaly if an application starts getting accessed from an unusual geographic location. If geographic information like country or city code isn’t directly available in the logs, you can get this information by geolocating the IP addresses using third-party services. Effectively, this is a process of binning the large number of IP addresses into a considerably smaller number of country or city codes. Although these country and city codes are still nominal values, they can be used with the cardinality aggregation of Amazon ES.

After we apply these preprocessing techniques to our example Amazon S3 server access logs, we get the resulting JSON log data:

    "bucket_owner": "", //string
    "bucket": "awsexamplebucket1", //string
    "timestamp": "06/Feb/2019:00:00:38 +0000",
    "remote_ip": "", //string
    "country_code": 100, //numeric field generated during pre-processing
    "requester": "", //string
    "request_id": "3E57427F3EXAMPLE",
    "operation": "REST.GET.VERSIONING",
    "key": "-",
    "request_uri": "GET /awsexamplebucket1?versioning HTTP/1.1",
    "http_method_get": 1, //nine one-hot encoded fields generated during pre-processing
    "http_method_post": 0,
    "http_method_put": 0,
    "http_method_delete": 0,
    "http_method_head": 0,
    "http_method_connect": 0,
    "http_method_options": 0,
    "http_method_trace": 0,
    "http_method_patch": 0,
    "http_status": 200,
    "http_1xx": 0, //five one-hot encoded fields generated during pre-processing
    "http_2xx": 1,
    "http_3xx": 0,
    "http_4xx": 0,
    "http_5xx": 0,
    "error_code": "-",
    "bytes_sent": 113,
    "object_size": "-",
    "total_time": 7,
    "turn_around_time": "-",
    "referer": "-",
    "user_agent": "S3Console/0.4",
    "version_id": "-",
    "host_id": "", //string
    "signature_version": "SigV2",
    "cipher_suite": "ECDHE-RSA-AES128-GCM-SHA256",
    "authentication_type": "AuthHeader",
    "host_header": "awsexamplebucket1.s3.us-west-1.amazonaws.com",
    "tls_version": "TLSV1.1"

This data can now be ingested and indexed into an Amazon ES domain. After you set up the log preprocessing pipeline, the next thing to configure is an anomaly detector. Amazon ES anomaly detection allows you to specify up to five features (fields in your data) in a single anomaly detector. This means the anomaly detector can learn patterns in data based on the values of up to five fields.


You must specify an appropriate aggregation function for each feature. This is because the anomaly detector aggregates the values of all documents ingested in each detector interval to produce a single aggregate value, and then that value is used as the input to the algorithm that automatically learns the patterns in data. The following diagram depicts this process.

After you configure the right features and corresponding aggregation functions, the anomaly detector starts to initialize. After processing a sufficient amount of data, the detector enters the running state.

To help you get started with anomaly detection on your own logs, the following table shows the preprocessing techniques and aggregation functions that might make sense for some common log fields.

Log Field Name Preprocessing Aggregation
HTTP response status code One-hot encoding sum
Client IP address IP geolocation to a country or city code cardinality
Log Message Level (INFO, WARN, ERR, FATAL etc.) One-hot encoding sum
Error or Exception names Map to numeric codes, additional binning and one-hot encoding if there are large number of possible values cardinality if using single numeric code field; sum if using one-hot encodings
Object Size / Bytes Sent / Content-Length None, use numeric value itself min, max, average
To monitor general traffic levels, you can use any numeric field like response code or bytes sent to count the number of log entries per detector interval None, use numeric value itself count (value_count) – simply counts the number of documents that have a value for this field


IT teams can use the anomaly detection feature of Amazon ES to implement proactive monitoring and alerting for applications and infrastructure logs. Anyone with basic scripting or programming skills should be able to implement the log preprocessing techniques discussed in this post—you don’t need to have in-depth knowledge of ML or data science. The anomaly detection feature is available in Amazon ES domains running Elasticsearch version 7.4 or later. To get started, see Anomaly detection in Amazon Elasticsearch Service.

About the Author

Kapil Pendse is a Senior Solutions Architect with Amazon Web Services (Singapore) and has over 15 years of experience building technology solutions across multiple domains such as cloud computing, embedded systems, and machine learning. In his free time, Kapil likes to bike along Singapore’s coastal parks and enjoys the occasional company of otters.

Field Notes: Data-Driven Risk Analysis with Amazon Neptune and Amazon Elasticsearch Service

Post Syndicated from Adriaan de Jonge original https://aws.amazon.com/blogs/architecture/field-notes-data-driven-risk-analysis-with-amazon-neptune-and-amazon-elasticsearch-service/

This blog post is co-authored with Charles Crouspeyre and Angad Srivastava. Charles is Director at Accenture Applied Intelligence and ASEAN AI SME (Subject Matter Expert) and Angad is Data and Analytics Consultant at AWS and NLP (Natural Language Processing) expert. Together, they are the lead architects of the solution presented in this blog.

In this blog, you learn how Amazon Neptune as a graph database, combined with Amazon Elasticsearch Service (Amazon ES) for full text indexing helps you shorten risk analysis processes from weeks to minutes. We give a walk-through of the steps involved in creating this knowledge management solution, which includes natural language processing components.

The business problem

Our Energy customer needs to do a risk assessment before acquiring raw materials that will be processed in their equipment. The process includes assessing the inventory of raw materials, the capacity of storage units, analyzing the performance of the processing units, and quality assurance of the end product. The cycle time for a comprehensive risk analysis across different teams working in silos is more than 2 weeks, while the window of opportunity for purchasing is a couple of days. So, the customer either puts their equipment and personnel at risk or misses good buying opportunities.

The solution described in this blog helps our customer improve and speed up their decision making. This is done through an automated analysis and understanding of the documents and information they have gathered over the years. They use Natural Language Processing (NLP) to analyze and better understand the documents which is discussed later on in this blog.

Our customer has accumulated years of documents that were mostly in silos across the organization: emails, SharePoint, local computer, private notes, and more.

The data is so heterogenous and widespread that it became hard for our customer to retrieve the right information in a timely manner. Our objective was to create a platform centralizing all this information, and to facilitate present and future information retrieval. Making informed decisions on time helps our customer to purchase raw materials at a better price, increasing their margins significantly.

Overview of business solution

To understand the tasks involved, let’s look at the high-level platform workflow:

Figure 1: This illustration visualizes a 4-step process consisting of Hydrate, Analyze, Search and Feedback.

Figure 1: This illustration visualizes a 4-step process consisting of Hydrate, Analyze, Search and Feedback.

We can summarize our workflow as a 4-step process:

  1. Hydrate: where we extract the information from multiple sources and do a first level of processing such as document scanning and natural language processing (NLP).
  2. Analyze: where the information extracted from the hydration step is ingested and merged with existing information.
  3. Search: where information is retrieved from the system based on user queries, by leveraging our knowledge graph and the concept map representation that we have created.
  4. Feedback: where users can rate the results for the system as good or bad. The feedback is collected and used to update the Knowledge graph, to re-train our models or to improve our query matching layer.

High-level technical architecture

The following architecture consists of a traditional data layer, combined with a knowledge layer. The compute part of the solution is serverless. The database storage part requires long-running solutions.

Figure 2: A diagram visualizing the steps involved in data processing across two layers, the data layer and the knowledge layer and their implementations with AWS services.

Figure 2: A diagram visualizing the steps involved in data processing across two layers, the data layer and the knowledge layer and their implementations with AWS services.

The data layer of our application is similar to many common data analytics setups, and includes:

  • An ingestion and normalization component, implemented with AWS Lambda, fronted by Amazon API Gateway and AWS Transfer Family
  • An ETL component, implemented with AWS Glue and AWS Lambda
  • A data enhancement component, implemented with Lambda
  • An analytics component, implemented with Amazon Redshift
  • A knowledge query component, implemented with Lambda
  • A user interface, a custom implementation based on React

Where our solution really adds value, is the knowledge layer, which is what we will focus on in this blog. We created this specifically for our knowledge representation and management. This layer consists of the following:

  • The knowledge extraction block, where the raw text is extracted, analyzed and classified into facts and structured data. This is implemented using Amazon SageMaker and Amazon Comprehend.
  • The knowledge repository, where the raw data is saved and kept is Amazon Simple Storage Service (Amazon S3).
  • The relationship, and knowledge extraction, and indexing, where the facts extracted earlier are analyzed and added to our knowledge graph.  This is implemented with a combination of Neptune, Amazon S3, Amazon DocumentDB (with MongoDB compatibility) and Amazon ES. Neptune is used as a property graph, queried with the Gremlin graph traversal language.
  • The knowledge aggregator, where we leverage both our knowledge graph and business representation to extract facts to associate with the user query, and rank information based on their relevance. This is implemented leveraging Amazon ES.

The last component, the knowledge aggregator, is fundamental for our infrastructure. In general, when we talk about information retrieval system – a system designed to supply the right information in the hands of users at the right time – there are two common approaches:

  1. Keyword-based search: take the user query and search for the presence of certain keywords from the query in the available documents.
  2. Concept-based search: build a business-related taxonomy to extend the keyword-based search into a business-related concept-based search.

The downside of a keyword-based search is that it does not capture the complexity and specificity of the business domain in which the query occurs.  Due to this limitation, we chose to go with a concept-based search approach as it allows us to inject a layer of business understanding to our ingestion and information retrieval.

Knowledge layer deep-dive

Because the value added from our solution is in the knowledge layer, let’s dive deeper into the details of this layer.

Figure 3: An architecture diagram of the knowledge layer of the solution, classified in 3 categories: ingestion, knowledge representation and retrieval

Figure 3: An architecture diagram of the knowledge layer of the solution, classified in 3 categories: ingestion, knowledge representation and retrieval

The architecture in Figure 3 describes the technical solution architecture broken down into 3 key steps. The 3 steps are:

  1. Ingestion
  2. Knowledge representation
  3. Retrieval

Another way to approach the problem definition is by looking at the process flow for how raw data/information flows through the system to generate the knowledge layer. Figure 4 gives an example of how the information is broadly treated as it progresses the logical phases of the process flow.


Figure 6: Context based Knowledge Graph Generation

Figure 4: An illustration of how information is extracted from an unstructured document, modeled as a graph and visualized in a business-friendly and concise format.

In this example, we can recognize a raw material of type “Champion” and detect a relationship between this entity and another entity of type “basic nitrogen”. This relationship is classified as the type “is characterized by”.

The facts in the parsed content are then classified into different categories of relevancy based on the contextual information contained in the text i.e., an important paragraph that mentions a potential issue will get classified as a key highlight with a high degree of relevancy.

This paragraph text is further analyzed to recognize and extract entities mentioned such as “Champion” and “basic nitrogen”; and to determine the semantic relationship between these entities based on the context of the paragraph i.e., “characterized by” and, “incompatibility due to low levels of basic nitrogen”.

There is a correlation between the different steps of the technical architecture versus the phases in the information analysis process. So we will present them together.

This table shows the correlation between the different steps of the technical architecture versus the phases in the information analysis process.

  • During the Ingestion step in the technical solution architecture, the aim is to process the incoming raw data in any format as defined in the Extract Information phase of the information analysis process flow.
  • Once the data ingestion has occurred, the next step is to capture the knowledge representation. The contextualize information phase of the information analysis process flow helps ensure that comprehensive and accurate knowledge representation occurs within the system.
  • The last step for the solution is to then facilitate retrieval of information by providing appropriate interfaces for interacting with the knowledge representation within the system. This is facilitated by the assemble information phase of the Information Analysis process.

To further understand the proposed solution, let us review the steps and the associated process flow phases.

Technical Architecture Step 1: Ingestion

Information comes in through the ingestion pipeline from various sources, such as websites, reports, news, blogs and internal data. Raw data enters the system either through automated API-based integrations with external websites or internal systems like Microsoft SharePoint, or can be ingested manually through AWS Transfer Family. Once a new piece of data has been ingested into the solution, it initiates the process for extracting information from the raw data.

Information Analysis Phase 1: Extract information

Once the information lands in our system, the knowledge representation process starts with our Lambda functions acting as the orchestrator between other components. Amazon SageMaker was initially used to create custom models for document categorization and classification of ingested unstructured files.

For example, an unstructured file that is ingested into our system gets recognized as a new email (one of the acceptable data sources) and is classified as “compatibility highlights” based on the email contents. But with improvements in the capabilities of Amazon Comprehend managed service, the need for custom model development, maintenance, and machine learning operations (MLOps) could be reduced. The solution now uses Amazon Comprehend with custom training for the initial step of document categorization and information extraction. Additionally, Amazon Comprehend was also used to create custom named-entity recognition models, that were trained to recognize custom raw materials and properties.

In this example, an unstructured pdf document is ingested into our system as illustrated in Figure 5.

Example of an unstructured pdf document being ingested into our system

Figure 5: Phase 1 – Information Extraction

Amazon Comprehend analyzes the unstructured document, classifies its contents and extracts a specific piece of information regarding a type of raw material called “Champion”. This has an inherent property called “low basic nitrogen” associated with it.

Technical Architecture Step 2: Knowledge representation

Knowledge representation is the process of extracting semantic relationships between the various information/data elements within a piece of raw data. It then incorporates it into the existing layers of knowledge already identified and stored. Based on the categorization of the document, the raw text is pre-processed and parsed into logical units. The parsed data is then analyzed in our NLP layer for content identification and fact classification.

The facts and key relationships deduced from the results of both Amazon Comprehend are returned back to the Lambda functions, which in-turn store the detected facts to the knowledge graph.

Information Analysis Phase 2: Contextualize information

Once the information is extracted from the document; our first step is to contextualize the information using our business representation in the form of a taxonomy. The system detects different parts and entities that the paragraph is composed of, and structures the information into our knowledge graph as illustrated in Figure 6.

Figure 6: Context based Knowledge Graph Generation

Figure 6: Context based Knowledge Graph Generation

This data extraction process is repeated iteratively, so that the knowledge graph grows over time through the detection of new facts and relationships. When we ingest new data into our knowledge graph, we search our knowledge graph for similar entities. If a similar entity exists, we analyze the type of relationships and properties both entities have. When we observe sufficient similarities between the entities, we associate relationships from one entity to the other.

For example, a new entity “Crude A” which has the properties – level of basic nitrogen and level of sulfur is ingested. Next, we have “Champion”, as described above, which has similar levels of basic nitrogen and a property “risk” associated to it. Based on the existing knowledge graph, we can now infer that there is a high probability that “Crude A” has a similar risk associated to it as shown in Figure 7.

Figure 7: Crude Knowledge Graph Representation

Figure 7: Crude Knowledge Graph Representation

The probability calculations can take multiple factors into consideration to make the process more accurate. This makes the structure of the knowledge graph quite dynamic and evolve automatically.

The complete raw data is also stored in Amazon ES as a secondary implementation to perform free form queries. This process helps ensure that all the relevant information for any extracted factoid associated with an entity in the knowledge graph is completely represented with the system. Some of this information may not exist in the knowledge graph because the document data extraction model can’t capture all the relevant information. One reason for such a problem to occur can be poor quality of the source document making automated reading of documents and data extraction difficult. Another reason can be the sophistication of the Amazon Comprehend models.

Technical Architecture Step 3: Retrieval

To retrieve information, the user query is analyzed by the Lambda function on the right side of Figure 3. Based on the analysis, key terms are identified from the user query for which a search needs to be performed. For example, if the query provided is “What is the likelihood of damage due to processing Champion in location A”, semantic analysis of the query will indicate that we are looking for relationships between entities Champion, any type of risks, any known incidents at location A and possible mitigations to reduce identified risks.

To address the query, the information then needs to compiled together from the existing knowledge graph as well as Amazon ES to provide a complete answer.

Information Analysis Phase 3: Assemble information

Figure 8 illustrates the output of Information assembly process.

Figure 8: "Champion" crude assembled information for property Nitrogen

Figure 8: “Champion” crude assembled information for property Nitrogen

Based on the facts available within the knowledge graph, we have identified that for “Champion” there is a possibility of damage occurring “due to increased pressure drop and loss of heat transfer” but this can be mitigated by “blending champion to meet basic nitrogen levels”.

In addition, say there was information available about “Crude B” that has been processed at “location A”. This also originated from “Brunei” and had similar “Nitrogen” and properties such as “Kerogen3”, “napthenic” and had a processing incident causing damage. We can then conclude by looking at the information stored within the knowledge graph and Amazon ES, that there are other possibilities for damage to occur due to processing of “Champion” at “Location A” as well.

Once all the relevant pieces of information have been collected, a sorted list of information is sent back to the user interface to be displayed.

Fact reconciliation

In reality, it is possible that new information contradicts existing information, which causes conflicts during ingestion. There are various ways to handle such contradictions, for example:

Figure 5: Visualizations of four illustrative ways to deal with contradictory new facts.

Figure 9: Visualizations of four illustrative ways to deal with contradictory new facts.

  1. Assume the latest data is the most accurate, by looking at the timestamp of each data point. This makes it possible to update the list of facts in our knowledge graph
  2. Analyze how new facts alter the properties or relationships of existing facts and update them or create a relationship between nodes
  3. Calculate a reliability score for the available sources, to rank the fact based on who has provided them
  4. Ask for end user feedback through the user interface

In our solution, we have mechanism 1, 2, and 4. Mechanisms 1 and 2 are implemented within the contextualize information phase of the information analysis process.

Mechanism 4 is implemented in the search results use interface where the user has a ‘thumbs up’ and ‘thumbs down’ button to classify the different search results as relevant or not. This information is then fed back into the Amazon Comprehend model, the knowledge graph as well as captured within Amazon ES to optimize subsequent search results.

Over time, mechanism 4 can be expanded to capture more detailed feedback including corrections to the search result instead of a simple yes/no feedback mechanism.  Such enhancements to mechanism 4 and the implementation of Mechanism 3 can be a possible future enhancement for the solution proposed.


Our customer needed help to shorten their risk analysis process to make high-impact purchase decisions for raw materials. Our knowledge management solution helped them extract knowledge from their vast set of documents and make it available in knowledge graph format, for risk analysts to analyze. Knowledge graphs are a great way to handle this “domain specificity”. It helps extract information during the ingestion phase. It also helps contextualize queries during the retrieval phase.

The possibilities are endless. One thing is certain: we’d encourage you to use graph databases with Neptune supported by Amazon ES for your use cases with highly connected data!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.


Charles Crouspeyre

Charles Crouspeyre is leading the AI Engineering practice for Accenture in ASEAN, where he is helping companies from all industries think and deploy their AI ambitions. When not working, he likes to spend time with his young-aged daughter reading/drawing/cooking/singing/playing hide & seek with her as she “requests”.

Angad Srivastava

Angad Srivastava

Angad Srivastava is a Data and Analytics Consultant at AWS in Singapore, where he consults with clients in ASEAN to develop robust AI solutions. When not at his desk, he can be found planning his next budget-friendly backpacking trip to check off yet another country from his bucket list.

Monitor your Amazon ES domains with Amazon Elasticsearch Service Monitor

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/monitor-your-amazon-es-domains-with-amazon-elasticsearch-service-monitor/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services.

Amazon ES provides a wealth of information about your domain, surfaced through Amazon CloudWatch metrics (for more information, see Instance metrics). Your domain’s dashboard on the AWS Management Console collects key metrics and provides a view of what’s going on with that domain. This view is limited to that single domain, and for a subset of the available metrics. What if you’re running many domains? How can you see all their metrics in one place? You can set CloudWatch alarms at the single domain level, but what about anomaly detection and centralized alerting?

In this post, we detail Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions, backed by a set of AWS CloudFormation templates delivered through the AWS Cloud Development Kit (AWS CDK). The templates deploy an Amazon ES domain in a VPC, an Nginx proxy for Kibana access, and an AWS Lambda function. The function is invoked by CloudWatch Events to pull metrics from all your Amazon ES domains and send them to the previously created monitoring domain for your review.

Your Amazon ES monitoring domain is an ideal way to monitor your Amazon ES infrastructure. We provide dashboards at the account and individual domain level. We also provide basic alerts that you can use as a template to build your own alerting solution.


To bootstrap the solution, you need a few tools in your development environment:

Create and deploy the AWS CDK monitoring tool

Complete the following steps to set up the AWS CDK monitoring tool in your environment. Depending on your operating system, the commands may differ. This walkthrough uses Linux and bash.

Clone the code from the GitHub repo:

# clone the repo
$ git clone https://github.com/aws-samples/amazon-elasticsearch-service-monitor.git
# move to directory
$ cd amazon-elasticsearch-service-monitor

We provide a bash bootstrap script to prepare your environment for running the AWS CDK and deploying the architecture. The bootstrap.sh script is in the amazon-elasticsearch-service-monitor directory. The script creates a Python virtual environment and downloads some further dependencies. It creates an Amazon Elastic Compute Cloud (Amazon EC2) key pair to facilitate accessing Kibana, then adds that key pair to your local SSH setup. Finally, it prompts for an email address where the stack sends alerts. You can edit email_default in the script or enter it at the command line when you run the script. See the following code:

$ bash bootstrap.sh
Collecting astroid==2.4.2
  Using cached astroid-2.4.2-py3-none-any.whl (213 kB)
Collecting attrs==20.3.0
  Using cached attrs-20.3.0-py2.py3-none-any.whl (49 kB)

After the script is complete, enter the Python virtual environment:

$ source .env/bin/activate
(.env) $

Bootstrap the AWS CDK

The AWS CDK creates resources in your AWS account to enable it to track your deployments. You bootstrap the AWS CDK with the bootstrap command:

# bootstrap the cdk
(.env) $ cdk bootstrap aws://yourAccountID/yourRegion

Deploy the architecture

The monitoring_cdk directory collects all the components that enable the AWS CDK to deploy the following architecture.

You can review amazon-elasticsearch-service-monitor/monitoring_cdk/monitoring_cdk_stack.py for further details.

The architecture has the following components:

  • An Amazon Virtual Private Cloud (Amazon VPC) spanning two Amazon EC2 Availability Zones.
  • An Amazon ES cluster with two t3.medium data nodes, one in each Availability Zone, with 100 GB of EBS storage.
  • An Amazon DynamoDB table for tracking the timestamp for the last pull from CloudWatch.
  • A Lambda function to fetch CloudWatch metrics across all Regions and all domains. By default, it fetches the data every 5 minutes, which you can change if needed.
  • An EC2 instance that acts as an SSH tunnel to access Kibana, because our setup is secured and in a VPC.
  • A default Kibana dashboard to visualize metrics across all domains.
  • Default email alerts to the newly launched Amazon ES cluster.
  • An index template and Index State Management (ISM) policy to delete indexes older than 366 days. (You can change this to a different retention period if needed.)
  • A monitoring stack with the option to enable UltraWarm (UW), which is disabled by default. You can change the settings in the monitoring_cdk_stack.py file to enable UW.

The monitoring_cdk_stack.py file contains several constants at the top that let you control the domain configuration, its sizing, and the Regions to monitor. It also specifies the username and password for the admin user of your domain. You should edit and replace those constants with your own values.

For example, the following code indicates which Regions to monitor:

REGIONS_TO_MONITOR='["us-east-1", "us-east-2", "us-west-1", "us-west-2", "af-south-1", "ap-east-1", "ap-south-1", "ap-northeast-1", "ap-northeast-2", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "eu-north-1", "eu-south-1", "me-south-1",   "sa-east-1"]'

Run the following command:

(.env)$ cdk deploy

The AWS CDK prompts you to apply security changes; enter y for yes.

After the app is deployed, you get the Kibana URL, user, and password to access Kibana. After you log in, use the following sections to navigate around dashboards and alerts.

After the stack is deployed, you receive an email to confirm the subscription; make sure to confirm the email to start getting the alerts.

Pre-built monitoring dashboards

The monitoring tool comes with pre-built dashboards. To access them, complete the following steps:

  1. Navigate to the IP obtained after deployment.
  2. Log in to Kibana.
    Be sure to use the endpoint you received, provided as an output from the cdk deploy command
  3. In the navigation pane, choose Dashboard.

The Dashboards page displays the default dashboards.

The Domain Metrics At A glance dashboard gives a 360-degree view of all Amazon ES domains across Regions.

The Domain Overview dashboard gives more detailed metrics for a particular domain, to help you deep dive into issues in a specific domain.

Pre-built alerts

The monitoring framework comes with pre-built alerts, as summarized in the following table. These alerts notify you on key resources like CPU, disk space, and JVM. We also provide alerts for cluster status, snapshot failures, and more. You can use the following alerts as a template to create your own alerts and monitoring for search and indexing latencies and volumes, for example.

Alert Type Frequency
Cluster Health – Red 5 Min
Cluster Index Writes Blocked 5 Min
Automated Snapshot Failure 5 Min
JVM Memory Pressure > 80% 5 Min
CPU Utilization > 80% 15 Min
No Kibana Healthy Nodes 15 Min
Invalid Host Header Requests 15 Min
Cluster Health – Yellow 30 Min

Clean up

To clean up the stacks, destroy the monitoring-cdk stack; all other stacks are torn down due to dependencies:

# Enter into python virtual environment
$ source .env/bin/activate
(.env)$ cdk destroy

CloudWatch logs need to be removed separately.


Running this solution incurs charges of less than $10 per day for one domain, with an additional $2 per day for each additional domain.


In this post, we discussed Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions. Amazon ES monitoring domains are an ideal way to monitor your Amazon ES infrastructure. Try it out and leave your thoughts in the comments.

About the Authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.




Prashant Agrawal is a Specialist Solutions Architect at Amazon Web Services based in Seattle, WA.. Prashant works closely with Amazon Elasticsearch team, helping customers migrate their workloads to the AWS Cloud. Before joining AWS, Prashant helped various customers use Elasticsearch for their search and analytics use cases.

Getting started with Trace Analytics in Amazon Elasticsearch Service

Post Syndicated from Jeff Wright original https://aws.amazon.com/blogs/big-data/getting-started-with-trace-analytics-in-amazon-elasticsearch-service/

Trace Analytics is now available for Amazon Elasticsearch Service (Amazon ES) domains running versions 7.9 or later. Developers and IT Ops teams can use this feature to troubleshoot performance and availability issues in their distributed applications. It provides end-to-end insights that are not possible with traditional methods of collecting logs and metrics from each component and service individually.

This feature provides a mechanism to ingest OpenTelemetry-standard trace data to be visualized and explored in Kibana. Trace Analytics introduces two new components that fit into the OpenTelemetry and Amazon ES ecosystems:

  • Data Prepper: A server-side application that collects telemetry data and transforms it for Amazon ES.
  • Trace Analytics Kibana plugin: A plugin that provides at-a-glance visibility into your application performance and the ability to drill down on individual traces. The plugin relies on trace data collected and transformed by Data Prepper.

Here is a component overview:

Here is a component overview:

Applications are instrumented with OpenTelemetry instrumentation, which emit trace data to OpenTelemetry Collectors. Collectors can be run as agents on Amazon EC2, as sidecars for Amazon ECS, or as sidecars or DaemonSets for Amazon EKS. They are configured to export traces to Data Prepper, which transforms the data and writes it to Amazon ES. The Trace Analytics Kibana plugin can then be used to visualize and detect problems in your distributed applications.

OpenTelemetry is a Cloud Native Computing Foundation (CNCF) project that aims to define an open standard for the collection of telemetry data. Using an OpenTelemetry Collector in your service environment allows you to ingest trace data from a other projects like Jaeger, Zipkin, and more. As of version 0.7.1, Data Prepper is currently an alpha release. It is a monolithic, vertically scaling component. Work on the next version is underway. It will support more features, including horizontal scaling.

In this blog post, we cover:

  • Launching Data Prepper to send trace data to your Amazon ES domain.
  • Configuring an OpenTelemetry Collector to send trace data to Data Prepper.
  • Exploring the Kibana Trace Analytics plugin using a sample application.


To get started, you need:

  • An Amazon ES domain running version 7.9 or later.
    • An IAM role for EC2 that has been added to the domain’s access policy. For information, see Create an IAM role in the Amazon EC2 User Guide for Linux Instances.
  • This CloudFormation template, which you use in the walkthrough. Be sure to download it now.
  • An SSH key pair to be deployed to a new EC2 instance.

Deploy to EC2 with CloudFormation

Use the CloudFormation template to deploy Data Prepper to EC2.

  1. Open the AWS CloudFormation console, and choose Create stack.
  2. In Specify template, choose Upload a template file, and then upload the CloudFormation template.
  3. All fields on the Specify stack details page are required. Although you can use the defaults for most fields, enter your values for the following:
    • AmazonEsEndpoint
    • AmazonEsRegion
    • AmazonEsSubnetId (if your Amazon ES domain is in a VPC)
    • IAMRole
    • KeyName

The InstanceType parameter allows you to specify the size of the EC2 instance that will be created. For recommendations on instance sizing by workload, see Right Sizing: Provisioning Instances to Match Workloads, and the Scaling and Tuning guide of the Data Prepper repository.

It should take about three minutes to provision the stack. Data Prepper starts during the CloudFormation deployment. To view output logs, use SSH to connect to the EC2 host and then inspect the /var/log/data-prepper.out file.

Configure OpenTelemetry Collector

Now that Data Prepper is running on an EC2 instance, you can send trace data to it by running an OpenTelemetry Collector in your service environment. For information about installation, see Getting Started in the OpenTelemetry documentation. Make sure that the Collector is configured with an exporter that points to the address of the Data Prepper host. The following otel-collector-config.yaml example receives data from various sources and exports it to Data Prepper.


    endpoint: <data-prepper-address>:21890
    insecure: true

      receivers: [jaeger, otlp, zipkin]
      exporters: [otlp/data-prepper]

Be sure to allow traffic to port 21890 on the EC2 instance. You can do this by adding an inbound rule to the instance’s security group.

Explore the Trace Analytics Kibana plugin by using a sample application

If you don’t have an OpenTelemetry Collector running and would like to send sample data to your Data Prepper instance to try out the trace analytics dashboard, you can quickly set up an instance of the Jaeger Hot R.O.D. application on the EC2 instance with Docker Compose. Our setup script creates three containers on the EC2 instance:

  • Jaeger Hot R.O.D.: The example application to generate trace data.
  • Jaeger Agent: A network daemon that batches trace spans and sends them to the Collector.
  • OpenTelemetry Collector: A vendor-agnostic executable capable of receiving, processing, and exporting telemetry data.

Although your application, the OpenTelemetry Collectors, and Data Prepper instances typically wouldn’t reside on the same host in a real production environment, for simplicity and cost, we use one EC2 instance.

To start the sample application

  1. Use SSH to connect to the EC2 instance using the private key specified in the CloudFormation stack.
    1. When connecting, add a tunnel to port 8080 (the Hot R.O.D. container accepts connections from localhost only). You can do this by adding -L 8080:localhost:8080 to your SSH command.
  2. Download the setup script by running:
    wget https://raw.githubusercontent.com/opendistro-for-elasticsearch/data-prepper/master/examples/aws/jaeger-hotrod-on-ec2/setup-jaeger-hotrod.sh

  3. Run the script with sh setup-jaeger-hotrod.sh.
  4. Visit http://localhost:8080/ to access the Hot R.O.D. dashboard and start sending trace data!

Figure 2: Hot R.O.D. Rides on Demand

  1. After generating sample data with the Hot R.O.D. application, navigate to your Kibana dashboard and from the left navigation pane, choose Trace Analytics. The Dashboard view groups traces together by HTTP method and path so that you can see the average latency, error rate, and trends associated with an operation.

Figure 3: Dashboard page

  1. For a more focused view, choose Traces to drill down into a specific trace.

Figure 4: Traces page

  1. Choose Services to view all services in the application and an interactive map that shows how the various services connect to each other.

Figure 5: Services pageConclusion

Trace Analytics adds to the existing log analytics capabilities of Amazon ES, enabling developers to isolate sources of performance problems and diagnose root causes in their distributed applications. We encourage you to start sending your trace data to Amazon ES so you can benefit from Trace Analytics today.

About the Authors

Jeff Wright is a Software Development Engineer at Amazon Web Services where he works on the Search Services team. His interests are designing and building robust, scalable distributed applications. Jeff is a contributor to Open Distro for Elasticsearch.



Kowshik Nagarajaan is a Software Development Engineer at Amazon Web Services where he works on the Search Services team. His interests are building and automating distributed analytics applications. Kowshik is a contributor to Open Distro for Elasticsearch.



Anush Krishnamurthy is an Engineering Manager working on the Search Services team at Amazon Web Services.

Masking field values with Amazon Elasticsearch Service

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/security/masking-field-values-with-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services. Amazon ES provides a deep security model that spans many layers of interaction and supports fine-grained access control at the cluster, index, document, and field level, on a per-user basis. The service’s security plugin integrates with federated identity providers for Kibana login.

A common use case for Amazon ES is log analytics. Customers configure their applications to store log data to the Elasticsearch cluster, where the data can be queried for insights into the functionality and use of the applications over time. In many cases, users reviewing those insights should not have access to all the details from the log data. The log data for a web application, for example, might include the source IP addresses of incoming requests. Privacy rules in many countries require that those details be masked, wholly or in part. This post explains how to set up field masking within your Amazon ES domain.

Field masking is an alternative to field-level security that lets you anonymize the data in a field rather than remove it altogether. When creating a role, add a list of fields to mask. Field masking affects whether you can see the contents of a field when you search. You can use field masking to either perform a random hash or pattern-based substitution of sensitive information from users, who shouldn’t have access to that information.

When you use field masking, Amazon ES creates a hash of the actual field values before returning the search results. You can apply field masking on a per-role basis, supporting different levels of visibility depending on the identity of the user making the query. Currently, field masking is only available for string-based fields. A search result with a masked field (clientIP) looks like this:

  "_index": "web_logs",
  "_type": "_doc",
  "_id": "1",
  "_score": 1,
  "_source": {
    "agent": "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
    "bytes": 0,
    "clientIP": "7e4df8d4df7086ee9c05efe1e21cce8ff017a711ee9addf1155608ca45d38219",
    "host": "www.example.com",
    "extension": "txt",
    "geo": {
      "src": "EG",
      "dest": "CN",
      "coordinates": {
        "lat": 35.98531194,
        "lon": -85.80931806
    "machine": {
      "ram": 17179869184,
      "os": "win 7"

To follow along in this post, make sure you have an Amazon ES domain with Elasticsearch version 6.7 or higher, sample data loaded (this example uses the web logs data supplied by Kibana), and access to Kibana through a role with administrator privileges for the domain.

Configure field masking

Field masking is managed by defining specific access controls within the Kibana visualization system. You’ll need to create a new Kibana role, define the fine-grained access-control privileges for that role, specify which fields to mask, and apply that role to specific users.

You can use either the Kibana console or direct-to-API calls to set up field masking. In our first example, we’ll use the Kibana console.

To configure field masking in the Kibana console

  1. Log in to Kibana, choose the Security pane, and then choose Roles, as shown in Figure 1.

    Figure 1: Choose security roles

    Figure 1: Choose security roles

  2. Choose the plus sign (+) to create a new role, as shown in Figure 2.

    Figure 2: Create role

    Figure 2: Create role

  3. Choose the Index Permissions tab, and then choose Add index permissions, as shown in Figure 3.

    Figure 3: Set index permissions

    Figure 3: Set index permissions

  4. Add index patterns and appropriate permissions for data access. See the Amazon ES documentation for details on configuring fine-grained access control.
  5. Once you’ve set Index Patterns, Permissions: Action Groups, Document Level Security Query, and Include or exclude fields, you can use the Anonymize fields entry to mask the clientIP, as shown in Figure 4.

    Figure 4: Anonymize field

    Figure 4: Anonymize field

  6. Choose Save Role Definition.
  7. Next, you need to create one or more users and apply the role to the new users. Go back to the Security page and choose Internal User Database, as shown in Figure 5.

    Figure 5: Select Internal User Database

    Figure 5: Select Internal User Database

  8. Choose the plus sign (+) to create a new user, as shown in Figure 6.

    Figure 6: Create user

    Figure 6: Create user

  9. Add a username and password, and under Open Distro Security Roles, select the role es-mask-role, as shown in Figure 7.

    Figure 7: Select the username, password, and roles

    Figure 7: Select the username, password, and roles

  10. Choose Submit.

If you prefer, you can perform the same task by using the Amazon ES REST API using Kibana dev tools.

Use the following API to create a role as described in below snippet and shown in Figure 8.

PUT _opendistro/_security/api/roles/es-mask-role
  "cluster_permissions": [],
  "index_permissions": [
      "index_patterns": [
      "dls": "",
      "fls": [],
      "masked_fields": [
      "allowed_actions": [

Sample response:

  "status": "CREATED",
  "message": "'es-mask-role' created."
Figure 8: API to create Role

Figure 8: API to create Role

Use the following API to create a user with the role as described in below snippet and shown in Figure 9.

PUT _opendistro/_security/api/internalusers/es-mask-user
  "password": "xxxxxxxxxxx",
  "opendistro_security_roles": [

Sample response:

  "status": "CREATED",
  "message": "'es-mask-user' created."
Figure 9: API to create User

Figure 9: API to create User

Verify field masking

You can verify field masking by running a simple search query using Kibana dev tools (GET web_logs/_search) and retrieving the data first by using the kibana_user (with no field masking), and then by using the es-mask-user (with field masking) you just created.

Query responses run by the kibana_user (all access) have the original values in all fields, as shown in Figure 10.

Figure 10: Retrieval of the full clientIP data with kibana_user

Figure 10: Retrieval of the full clientIP data with kibana_user

Figure 11, following, shows an example of what you would see if you logged in as the es-mask-user. In this case, the clientIP field is hidden due to the es-mask-role you created.

Figure 11: Retrieval of the masked clientIP data with es-mask-user

Figure 11: Retrieval of the masked clientIP data with es-mask-user

Use pattern-based field masking

Rather than creating a hash, you can use one or more regular expressions and replacement strings to mask a field. The syntax is <field>::/<regular-expression>/::<replacement-string>.

You can use either the Kibana console or direct-to-API calls to set up pattern-based field masking. In the following example, clientIP is masked in such a way that the last three parts of the IP address are masked by xxx using the pattern is clientIP::/[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$/::xxx.xxx.xxx>. You see only the first part of the IP address, as shown in Figure 12.

Figure 12: Anonymize the field with a pattern

Figure 12: Anonymize the field with a pattern

Run the search query to verify that the last three parts of clientIP are masked by custom characters and only the first part is shown to the requester, as shown in Figure 13.

Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user

Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user


Field level security should be the primary approach for ensuring data access security – however if there are specific business requirements that cannot be met with this approach, then field masking may offer a viable alternative. By using field masking, you can selectively allow or prevent your users from seeing private information such as personally identifying information (PII) or personal healthcare information (PHI). For more information about fine-grained access control, see the Amazon Elasticsearch Service Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Prashant Agrawal

Prashant is a Search Specialist Solutions Architect with Amazon Elasticsearch Service. He works closely with team members to help customers migrate their workloads to the cloud. Before joining AWS, he helped various customers use Elasticsearch for their search and analytics use cases.

How to visualize multi-account Amazon Inspector findings with Amazon Elasticsearch Service

Post Syndicated from Moumita Saha original https://aws.amazon.com/blogs/security/how-to-visualize-multi-account-amazon-inspector-findings-with-amazon-elasticsearch-service/

Amazon Inspector helps to improve the security and compliance of your applications that are deployed on Amazon Web Services (AWS). It automatically assesses Amazon Elastic Compute Cloud (Amazon EC2) instances and applications on those instances. From that assessment, it generates findings related to exposure, potential vulnerabilities, and deviations from best practices.

You can use the findings from Amazon Inspector as part of a vulnerability management program for your Amazon EC2 fleet across multiple AWS Regions in multiple accounts. The ability to rank and efficiently respond to potential security issues reduces the time that potential vulnerabilities remain unresolved. This can be accelerated within a single pane of glass for all the accounts in your AWS environment.

Following AWS best practices, in a secure multi-account AWS environment, you can provision (using AWS Control Tower) a group of accounts—known as core accounts, for governing other accounts within the environment. One of the core accounts may be used as a central security account, which you can designate for governing the security and compliance posture across all accounts in your environment. Another core account is a centralized logging account, which you can provision and designate for central storage of log data.

In this blog post, I show you how to:

  1. Use Amazon Inspector, a fully managed security assessment service, to generate security findings.
  2. Gather findings from multiple Regions across multiple accounts using Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS).
  3. Use AWS Lambda to send the findings to a central security account for deeper analysis and reporting.

In this solution, we send the findings to two services inside the central security account:

Solution overview

Overall architecture

The flow of events to implement the solution is shown in Figure 1 and described in the following process flow.

Figure 1: Solution overview architecture

Figure 1: Solution overview architecture

Process flow

The flow of this architecture is divided into two types of processes—a one-time process and a scheduled process. The AWS resources that are part of the one-time process are triggered the first time an Amazon Inspector assessment template is created in each Region of each application account. The AWS resources of the scheduled process are triggered at a designated interval of Amazon Inspector scan in each Region of each application account.

One-time process

  1. An event-based Amazon CloudWatch rule in each Region of every application account triggers a regional AWS Lambda function when an Amazon Inspector assessment template is created for the first time in that Region.

    Note: In order to restrict this event to trigger the Lambda function only the first time an assessment template is created, you must use a specific user-defined tag to trigger the Attach Inspector template to SNS Lambda function for only one Amazon Inspector template per Region. For more information on tags, see the Tagging AWS resources documentation.

  2. The Lambda function attaches the Amazon Inspector assessment template (created in application accounts) to the cross-account Amazon SNS topic (created in the security account). The function, the template, and the topic are all in the same AWS Region.

    Note: This step is needed because Amazon Inspector templates can only be attached to SNS topics in the same account via the AWS Management Console or AWS Command Line Interface (AWS CLI).

Scheduled process

  1. A scheduled Amazon CloudWatch Event in every Region of the application accounts starts the Amazon Inspector scan at a scheduled time interval, which you can configure.
  2. An Amazon Inspector agent conducts the scan on the EC2 instances of the Region where the assessment template is created and sends any findings to Amazon Inspector.
  3. Once the findings are generated, Amazon Inspector notifies the Amazon SNS topic of the security account in the same Region.
  4. The Amazon SNS topics from each Region of the central security account receive notifications of Amazon Inspector findings from all application accounts. The SNS topics then send the notifications to a central Amazon SQS queue in the primary Region of the security account.
  5. The Amazon SQS queue triggers the Send findings Lambda function (as shown in Figure 1) of the security account.

    Note: Each Amazon SQS message represents one Amazon Inspector finding.

  6. The Send findings Lambda function assumes a cross-account role to fetch the following information from all application accounts:
    1. Finding details from the Amazon Inspector API.
    2. Additional Amazon EC2 attributes—VPC, subnet, security group, and IP address—from EC2 instances with potential vulnerabilities.
  7. The Lambda function then sends all the gathered data to a central S3 bucket and a domain in Amazon ES—both in the central security account.

These Amazon Inspector findings, along with additional attributes on the scanned instances, can be used for further analysis and visualization via Kibana—a data visualization dashboard for Amazon ES. Storing a copy of these findings in an S3 bucket gives you the opportunity to forward the findings data to outside monitoring tools that don’t support direct data ingestion from AWS Lambda.


The following resources must be set up before you can implement this solution:

  1. A multi-account structure. To learn how to set up a multi-account structure, see Setting up AWS Control Tower and AWS Landing zone.
  2. Amazon Inspector agents must be installed on all EC2 instances. See Installing Amazon Inspector agents to learn how to set up Amazon Inspector agents on EC2 instances. Additionally, keep note of all the Regions where you install the Amazon Inspector agent.
  3. An Amazon ES domain with Kibana authentication. See Getting started with Amazon Elasticsearch Service and Use Amazon Cognito for Kibana access control.
  4. An S3 bucket for centralized storage of Amazon Inspector findings.
  5. An S3 bucket for storage of the Lambda source code for the solution.

Set up Amazon Inspector with Amazon ES and S3

Follow these steps to set up centralized Amazon Inspector findings with Amazon ES and Amazon S3:

  1. Upload the solution ZIP file to the S3 bucket used for Lambda code storage.
  2. Collect the input parameters for AWS CloudFormation deployment.
  3. Deploy the base template into the central security account.
  4. Deploy the second template in the primary Region of all application accounts to create global resources.
  5. Deploy the third template in all Regions of all application accounts.

Step 1: Upload the solution ZIP file to the S3 bucket used for Lambda code storage

  1. From GitHub, download the file Inspector-to-S3ES-crossAcnt.zip.
  2. Upload the ZIP file to the S3 bucket you created in the central security account for Lambda code storage. This code is used to create the Lambda function in the first CloudFormation stack set of the solution.

Step 2: Collect input parameters for AWS CloudFormation deployment

In this solution, you deploy three AWS CloudFormation stack sets in succession. Each stack set should be created in the primary Region of the central security account. Underlying stacks are deployed across the central security account and in all the application accounts where the Amazon Inspector scan is performed. You can learn more in Working with AWS CloudFormation StackSets.

Before you proceed to the stack set deployment, you must collect the input parameters for the first stack set: Central-SecurityAcnt-BaseTemplate.yaml.

To collect input parameters for AWS CloudFormation deployment

  1. Fetch the account ID (CentralSecurityAccountID) of the AWS account where the stack set will be created and deployed. You can use the steps in Finding your AWS account ID to help you find the account ID.
  2. Values for the ES domain parameters can be fetched from the Amazon ES console.
    1. Open the Amazon ES Management Console and select the Region where the Amazon ES domain exists.
    2. Select the domain name to view the domain details.
    3. The value for ElasticsearchDomainName is displayed on the top left corner of the domain details.
    4. On the Overview tab in the domain details window, select and copy the URL value of the Endpoint to use as the ElasticsearchEndpoint parameter of the template. Make sure to exclude the https:// at the beginning of the URL.

      Figure 2: Details of the Amazon ES domain for fetching parameter values

      Figure 2: Details of the Amazon ES domain for fetching parameter values

  3. Get the values for the S3 bucket parameters from the Amazon S3 console.
    1. Open the Amazon S3 Management Console.
    2. Copy the name of the S3 bucket that you created for centralized storage of Amazon Inspector findings. Save this bucket name for the LoggingS3Bucket parameter value of the Central-SecurityAcnt-BaseTemplate.yaml template.
    3. Select the S3 bucket used for source code storage. Select the bucket name and copy the name of this bucket for the LambdaSourceCodeS3Bucket parameter of the template.

      Figure 3: The S3 bucket where Lambda code is uploaded

      Figure 3: The S3 bucket where Lambda code is uploaded

  4. On the bucket details page, select the source code ZIP file name that you previously uploaded to the bucket. In the detail page of the ZIP file, choose the Overview tab, and then copy the value in the Key field to use as the value for the LambdaCodeS3Key parameter of the template.

    Figure 4: Details of the Lambda code ZIP file uploaded in Amazon S3 showing the key prefix value

    Figure 4: Details of the Lambda code ZIP file uploaded in Amazon S3 showing the key prefix value

Note: All of the other input parameter values of the template are entered automatically, but you can change them during stack set creation if necessary.

Step 3: Deploy the base template into the central security account

Now that you’ve collected the input parameters, you’re ready to deploy the base template that will create the necessary resources for this solution implementation in the central security account.

Prerequisites for CloudFormation stack set deployment

There are two permission modes that you can choose from for deploying a stack set in AWS CloudFormation. If you’re using AWS Organizations and have all features enabled, you can use the service-managed permissions; otherwise, self-managed permissions mode is recommended. To deploy this solution, you’ll use self-managed permissions mode. To run stack sets in self-managed permissions mode, your administrator account and the target accounts must have two IAM roles—AWSCloudFormationStackSetAdministrationRole and AWSCloudFormationStackSetExecutionRole—as prerequisites. In this solution, the administrator account is the central security account and the target accounts are application accounts. You can use the following CloudFormation templates to create the necessary IAM roles:

To deploy the base template

  1. Download the base template (Central-SecurityAcnt-BaseTemplate.yaml) from GitHub.
  2. Open the AWS CloudFormation Management Console and select the Region where all the stack sets will be created for deployment. This should be the primary Region of your environment.
  3. Select Create StackSet.
    1. In the Create StackSet window, select Template is ready and then select Upload a template file.
    2. Under Upload a template file, select Choose file and select the Central-SecurityAcnt-BaseTemplate.yaml template that you downloaded earlier.
    3. Choose Next.
  4. Add stack set details.
    1. Enter a name for the stack set in StackSet name.
    2. Under Parameters, most of the values are pre-populated except the values you collected in the previous procedure for CentralSecurityAccountID, ElasticsearchDomainName, ElasticsearchEndpoint, LoggingS3Bucket, LambdaSourceCodeS3Bucket, and LambdaCodeS3Key.
    3. After all the values are populated, choose Next.
  5. Configure StackSet options.
    1. (Optional) Add tags as described in the prerequisites to apply to the resources in the stack set that these rules will be deployed to. Tagging is a recommended best practice, because it enables you to add metadata information to resources during their creation.
    2. Under Permissions, choose the Self service permissions mode to be used for deploying the stack set, and then select the AWSCloudFormationStackSetAdministrationRole from the dropdown list.

      Figure 5: Permission mode to be selected for stack set deployment

      Figure 5: Permission mode to be selected for stack set deployment

    3. Choose Next.
  6. Add the account and Region details where the template will be deployed.
    1. Under Deployment locations, select Deploy stacks in accounts. Under Account numbers, enter the account ID of the security account that you collected earlier.

      Figure 6: Values to be provided during the deployment of the first stack set

      Figure 6: Values to be provided during the deployment of the first stack set

    2. Under Specify regions, select all the Regions where the stacks will be created. This should be the list of Regions where you installed the Amazon Inspector agent. Keep note of this list of Regions to use in the deployment of the third template in an upcoming step.
      • Though an Amazon Inspector scan is performed in all the application accounts, the regional Amazon SNS topics that send scan finding notifications are created in the central security account. Therefore, this template is created in all the Regions where Amazon Inspector will notify SNS. The template has the logic needed to handle the creation of specific AWS resources only in the primary Region, even though the template executes in many Regions.
      • The order in which Regions are selected under Specify regions defines the order in which the stack is deployed in the Regions. So you must make sure that the primary Region of your deployment is the first one specified under Specify regions, followed by the other Regions of stack set deployment. This is required because global resources are created using one Region—ideally the primary Region—and so stack deployment in that Region should be done before deployment to other Regions in order to avoid any build dependencies.

        Figure 7: Showing the order of specifying the Regions of stack set deployment

        Figure 7: Showing the order of specifying the Regions of stack set deployment

  7. Review the template settings and select the check box to acknowledge the Capabilities section. This is required if your deployment template creates IAM resources. You can learn more at Controlling access with AWS Identity and Access Management.

    Figure 8: Acknowledge IAM resources creation by AWS CloudFormation

    Figure 8: Acknowledge IAM resources creation by AWS CloudFormation

  8. Choose Submit to deploy the stack set.

Step 4: Deploy the second template in the primary Region of all application accounts to create global resources

This template creates the global resources required for sending Amazon Inspector findings to Amazon ES and Amazon S3.

To deploy the second template

  1. Download the template (ApplicationAcnts-RolesTemplate.yaml) from GitHub and use it to create the second CloudFormation stack set in the primary Region of the central security account.
  2. To deploy the template, follow the steps used to deploy the base template (described in the previous section) through Configure StackSet options.
  3. In Set deployment options, do the following:
    1. Under Account numbers, enter the account IDs of your application accounts as comma-separated values. You can use the steps in Finding your AWS account ID to help you gather the account IDs.
    2. Under Specify regions, select only your primary Region.

      Figure 9: Select account numbers and specify Regions

      Figure 9: Select account numbers and specify Regions

  4. The remaining steps are the same as for the base template deployment.

Step 5: Deploy the third template in all Regions of all application accounts

This template creates the resources in each Region of all application accounts needed for scheduled scanning of EC2 instances using Amazon Inspector. Notifications are sent to the SNS topics of each Region of the central security account.

To deploy the third template

  1. Download the template InspectorRun-SetupTemplate.yaml from GitHub and create the final AWS CloudFormation stack set. Similar to the previous stack sets, this one should also be created in the central security account.
  2. For deployment, follow the same steps you used to deploy the base template through Configure StackSet options.
  3. In Set deployment options:
    1. Under Account numbers, enter the same account IDs of your application accounts (comma-separated values) as you did for the second template deployment.
    2. Under Specify regions, select all the Regions where you installed the Amazon Inspector agent.

      Note: This list of Regions should be the same as the Regions where you deployed the base template.

  4. The remaining steps are the same as for the second template deployment.

Test the solution and delivery of the findings

After successful deployment of the architecture, to test the solution you can wait until the next scheduled Amazon Inspector scan or you can use the following steps to run the Amazon Inspector scan manually.

To run the Amazon Inspector scan manually for testing the solution

  1. In any one of the application accounts, go to any Region where the Amazon Inspector scan was performed.
  2. Open the Amazon Inspector console.
  3. In the left navigation menu, select Assessment templates to see the available assessments.
  4. Choose the assessment template that was created by the third template.
  5. Choose Run to start the assessment immediately.
  6. When the run is complete, Last run status changes from Collecting data to Analysis Complete.

    Figure 10: Amazon Inspector assessment run

    Figure 10: Amazon Inspector assessment run

  7. You can see the recent scan findings in the Amazon Inspector console by selecting Assessment runs from the left navigation menu.

    Figure 11: The assessment run indicates total findings from the last Amazon Inspector run in this Region

    Figure 11: The assessment run indicates total findings from the last Amazon Inspector run in this Region

  8. In the left navigation menu, select Findings to see details of each finding, or use the steps in the following section to verify the delivery of findings to the central security account.

Test the delivery of the Amazon Inspector findings

This solution delivers the Amazon Inspector findings to two AWS services—Amazon ES and Amazon S3—in the primary Region of the central security account. You can either use Kibana to view the findings sent to Amazon ES or you can use the findings sent to Amazon S3 and forward them to the security monitoring software of your preference for further analysis.

To check whether the findings are delivered to Amazon ES

  1. Open the Amazon ES Management Console and select the Region where the Amazon ES domain is located.
  2. Select the domain name to view the domain details.
  3. On the domain details page, select the Kibana URL.

    Figure 12: Amazon ES domain details page

    Figure 12: Amazon ES domain details page

  4. Log in to Kibana using your preferred authentication method as set up in the prerequisites.
    1. In the left panel, select Discover.
    2. In the Discover window, select a Region to view the total number of findings in that Region.

      Figure 13: The total findings in Kibana for the chosen Region of an application account

      Figure 13: The total findings in Kibana for the chosen Region of an application account

To check whether the findings are delivered to Amazon S3

  1. Open the Amazon S3 Management Console.
  2. Select the S3 bucket that you created for storing Amazon Inspector findings.
  3. Select the bucket name to view the bucket details. The total number of findings for the chosen Region is at the top right corner of the Overview tab.

    Figure 14: The total security findings as stored in an S3 bucket for us-east-1 Region

    Figure 14: The total security findings as stored in an S3 bucket for us-east-1 Region

Visualization in Kibana

The data sent to the Amazon ES index can be used to create visualizations in Kibana that make it easier to identify potential security gaps and plan the remediation accordingly.

You can use Kibana to create a dashboard that gives an overview of the potential vulnerabilities identified in different instances of different AWS accounts. Figure 15 shows an example of such a dashboard. The dashboard can help you rank the need for remediation based on criteria such as:

  • The category of vulnerability
  • The most impacted AWS accounts
  • EC2 instances that need immediate attention
Figure 15: A sample Kibana dashboard showing findings from Amazon Inspector

Figure 15: A sample Kibana dashboard showing findings from Amazon Inspector

You can build additional panels to visualize details of the vulnerability findings identified by Amazon Inspector, such as the CVE ID of the security vulnerability, its description, and recommendations on how to remove the vulnerabilities.

Figure 16: A sample Kibana dashboard panel listing the top identified vulnerabilities and their details

Figure 16: A sample Kibana dashboard panel listing the top identified vulnerabilities and their details


By using this solution to combine Amazon Inspector, Amazon SNS topics, Amazon SQS queues, Lambda functions, an Amazon ES domain, and S3 buckets, you can centrally analyze and monitor the vulnerability posture of EC2 instances across your AWS environment, including multiple Regions across multiple AWS accounts. This solution is built following least privilege access through AWS IAM roles and policies to help secure the cross-account architecture.

In this blog post, you learned how to send the findings directly to Amazon ES for visualization in Kibana. These visualizations can be used to build dashboards that security analysts can use for centralized monitoring. Better monitoring capability helps analysts to identify potentially vulnerable assets and perform remediation activities to improve security of your applications in AWS and their underlying assets. This solution also demonstrates how to store the findings from Amazon Inspector in an S3 bucket, which makes it easier for you to use those findings to create visualizations in your preferred security monitoring software.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Moumita Saha

Moumita is a Security Consultant with AWS Professional Services working to help enterprise customers secure their workloads in the cloud. She assists customers in secure cloud migration, designing automated solutions to protect against cyber threats in the cloud. She is passionate about cyber security, data privacy, and new, emerging cloud-security technologies.

Using pipes to explore, discover and find data in Amazon ES with Piped Processing Language

Post Syndicated from Viraj Phanse original https://aws.amazon.com/blogs/big-data/using-pipes-to-explore-discover-and-find-data-in-amazon-es-with-piped-processing-language/

System developers, DevOps engineers, support engineers, site reliability engineers (SREs), and IT managers make sure that the underlying infrastructure powering the applications and systems within an organization is available, reliable, secure, and scalable. To achieve these goals, you need to perform a fast and deep analysis on the underlying logs, monitoring, and observability data. Amazon Elasticsearch Service (Amazon ES) is a popular choice to store and analyze such data. However, extracting insights from Elasticsearch isn’t easy. Although Query DSL (the language used to query data stored in Elasticsearch) is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

In this post, we discuss the newly supported Piped Processing Language (PPL) feature, powered by Open Distro for Elasticsearch, which enables you to form complex queries and quickly explore and discover data with the help of pipes.

What is Piped Processing Language?

Piped Processing Language is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch. PPL enables you to explore, discover, and find data stored in Elasticsearch, using a set of commands delimited by pipes ( | ).

Pipes allow you to combine two or more commands as a chain, such that the output of one command acts as an input for the next command, very similar to Unix pipes. With PPL, you can now search for keywords and feed the results from the command on the left of the pipe to the command on the right of the pipe, effectively creating a command pipeline.

Use case

As an illustration, consider a use case where you want to find out the number of hosts that are responding with HTTP 404 (Page not found) and HTTP 503 (Server Unavailability) errors, aggregate the error responses per host, and sort in the order of impact.

Using Query DSL

When you use Query DSL, the query looks similar to the following code:

GET kibana_sample_data_logs/_search

The following screenshot shows the query results.


Using PPL

You can replace the entire DSL query with a single PPL command:

source = kibana_sample_data_logs | where response='404' or response='503' | stats count(request) as request_count by host, response | sort -request_count

The following screenshot shows the query results.

Commands and functions supported by PPL

PPL supports a comprehensive set of commands, including search, where, field, rename, dedup, sort, stats, eval, head, top, and rare. These commands are read-only requests to process data and return results. The following table summarizes the purpose of each command.

Command What does it do? Example Result
search source Retrieves documents from the index. The keyword search can be ignored. source=accounts; Retrieves all documents from the accounts index.
field Keeps or removes fields from the search result. source=accounts | fields account_number, firstname, lastname; Gets account_number, firstname, and lastname fields from the search result.
dedup Removes duplicate documents defined by a field from the search result. source=accounts | dedup gender | fields account_number, gender; Removes duplicate documents with the same gender.
stats Aggregates the search results using sum, count, min, max, and avg. source=accounts | stats avg(age); Calculates the average age of all accounts.
eval Evaluates an expression and appends its result to the search result. search source=accounts | eval doubleAge = age * 2 | fields age, doubleAge; Creates a new doubleAge field for each document that is age * 2.
head Returns the first N number of results in a specified search order. search source=accounts | fields firstname, age | head; Fetches the first 10 results.
top Finds the most common values of all fields in the field list. search source=accounts | top gender; Finds the most common value of gender.
rare Finds the least common values of all fields in a field list. search source=accounts | rare gender; Finds the least common value of gender.
where Filters the search result. search source=accounts | where account_number=1 or gender="F" | fields account_number, gender; Gets all the documents from the account index.
rename Renames one or more fields in a search result. search source=accounts | rename account_number as an | fields acc; Renames the account field as acc.
sort Sorts results in a specified field. search source=accounts | sort age | fields account_number, age; Sorts all documents by age field in ascending order.

PPL also supports functions including date-time, mathematical, string, aggregate, and trigonometric, and operators and expressions.


Piped Processing Language, powered by Open Distro for Elasticsearch, has a comprehensive set of commands and functions that enable you to quickly begin extracting insights from your data in Elasticsearch. It’s supported on all Amazon ES domains running Elasticsearch 7.9 or greater. PPL also expands the capabilities of the Query Workbench in Kibana in addition to SQL. For more information, see Piped Processing Language.

About the Author

Viraj Phanse is a product management leader at Amazon Web Services for Search Services/Analytics. An avid foodie, he loves trying cuisines from around the globe. In his free time, he loves to play his keyboard and travel.

Get started with fine-grained access control in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/security/get-started-with-fine-grained-access-control-in-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) provides fine-grained access control, powered by the Open Distro for Elasticsearch security plugin. The security plugin adds Kibana authentication and access control at the cluster, index, document, and field levels that can help you secure your data. You now have many different ways to configure your Amazon ES domain to provide access control. In this post, I offer basic configuration information to get you started.

Figure 1: A high-level view of data flow and security

Figure 1: A high-level view of data flow and security

Figure 1 details the authentication and access control provided in Amazon ES. The left half of the diagram details the different methods of authenticating. Looking horizontally, requests originate either from Kibana or directly access the REST API. When using Kibana, you can use a login screen powered by the Open Distro security plugin, your SAML identity provider, or Amazon Cognito. Each of these methods results in an authenticated identity: SAML providers via the response, Amazon Cognito via an AWS Identity and Access Management (IAM) identity, and Open Distro via an internal user identity. When you use the REST API, you can use AWS Signature V4 request signing (SigV4 signing), or user name and password authentication. You can also send unauthenticated traffic, but your domain should be configured to reject all such traffic.

The right side of the diagram details the access control points. You can consider the handling of access control in two phases to better understand it—authentication at the edge by IAM and authentication in the Amazon ES domain by the Open Distro security plugin.

First, requests from Kibana or direct API calls have to reach your domain endpoint. If you follow best practices and the domain is in an Amazon Virtual Private Cloud (VPC), you can use Amazon Elastic Compute Cloud (Amazon EC2) security groups to allow or deny traffic based on the originating IP address or security group of the Amazon EC2 instances. Best practice includes least privilege based on subnet ACLs and security group ingress and egress restrictions. In this post, we assume that your requests are legitimate, meet your access control criteria, and can reach your domain.

When a request reaches the domain endpoint—the edge of your domain—, it can be anonymous or it can carry identity and authentication information as described previously. Each Amazon ES domain carries a resource-based IAM policy. With this policy, you can allow or deny traffic based on an IAM identity attached to the request. When your policy specifies an IAM principal, Amazon ES evaluates the request against the allowed Actions in the policy and allows or denies the request. If you don’t have an IAM identity attached to the request (SAML assertion, or user name and password) you should leave the domain policy open and pass traffic through to fine-grained access control in Amazon ES without any checks. You should employ IAM security best practices and add additional IAM restrictions for direct-to-API access control once your domain is set up.

The Open Distro for Elasticsearch security plugin has its own internal user database for user name and password authentication and handles access control for all users. When traffic reaches the Elasticsearch cluster, the plugin validates any user name and password authentication information against this internal database to identify the user and grant a set of permissions. If a request comes with identity information from either SAML or an IAM role, you map that backend role onto the roles or users that you have created in Open Distro security.

Amazon ES documentation and the Open Distro for Elasticsearch documentation give more information on all of these points. For this post, I walk through a basic console setup for a new domain.

Console set up

The Amazon ES console provides a guided wizard that lets you configure—and reconfigure—your Amazon ES domain. Step 1 offers you the opportunity to select some predefined configurations that carry through the wizard. In step 2, you choose the instances to deploy in your domain. In Step 3, you configure the security. This post focuses on step 3. See also these tutorials that explain using an IAM master user and using an HTTP-authenticated master user.

Note: At the time of writing, you cannot enable fine-grained access control on existing domains; you must create a new domain and enable the feature at domain creation time. You can use fine-grained access control with Elasticsearch versions 6.8 and later.

Set your endpoint

Amazon ES gives you a DNS name that resolves to an IP address that you use to send traffic to the Elasticsearch cluster in the domain. The IP address can be in the IP space of the public internet, or it can resolve to an IP address in your VPC. While—with fine-grained access control—you have the means of securing your cluster even when the endpoint is a public IP address, we recommend using VPC access as the more secure option. Shown in Figure 2.

Figure 2: Select VPC access

Figure 2: Select VPC access

With the endpoint in your VPC, you use security groups to control which ports accept traffic and limit access to the endpoints of your Amazon ES domain to IP addresses in your VPC. Make sure to use least privilege when setting up security group access.

Enable fine-grained access control

You should enable fine-grained access control. Shown in Figure 3.

Figure 3: Enabled fine-grained access control

Figure 3: Enabled fine-grained access control

Set up the master user

The master user is the administrator identity for your Amazon ES domain. This user can set up additional users in the Amazon ES security plugin, assign roles to them, and assign permissions for those roles. You can choose user name and password authentication for the master user, or use an IAM identity. User name and password authentication, shown in Figure 4, is simpler to set up and—with a strong password—may provide sufficient security depending on your use case. We recommend you follow your organization’s policy for password length and complexity. If you lose this password, you can return to the domain’s dashboard in the AWS Management Console and reset it. You’ll use these credentials to log in to Kibana. Following best practices on choosing your master user, you should move to an IAM master user once setup is complete.

Note: Password strength is a function of length, complexity of characters (e.g., upper and lower case letters, numbers, and special characters), and unpredictability to decrease the likelihood the password could be guessed or cracked over a period of time.


Figure 4: Setting up the master username and password

Figure 4: Setting up the master username and password

Do not enable Amazon Cognito authentication

When you use Kibana, Amazon ES includes a login experience. You currently have three choices for the source of the login screen:

  1. The Open Distro security plugin
  2. Amazon Cognito
  3. Your SAML-compliant system

You can apply fine-grained access control regardless of how you log in. However, setting up fine-grained access control for the master user and additional users is most straightforward if you use the login experience provided by the Open Distro security plugin. After your first login, and when you have set up additional users, you should migrate to either Cognito or SAML for login, taking advantage of the additional security they offer. To use the Open Distro login experience, disable Amazon Cognito authentication, as shown in Figure 5.

Figure 5: Amazon Cognito authentication is not enabled

Figure 5: Amazon Cognito authentication is not enabled

If you plan to integrate with your SAML identity provider, check the Prepare SAML authentication box. You will complete the set up when the domain is active.

Figure 6: Choose Prepare SAML authentication if you plan to use it

Figure 6: Choose Prepare SAML authentication if you plan to use it

Use an open access policy

When you create your domain, you attach an IAM policy to it that controls whether your traffic must be signed with AWS SigV4 request signing for authentication. Policies that specify an IAM principal require that you use AWS SigV4 signing to authenticate those requests. The domain sends your traffic to IAM, which authenticates signed requests to resolve the user or role that sent the traffic. The domain and IAM apply the policy access controls and either accept the traffic or reject it based on the commands. This is done down to the index level for single-index API calls.

When you use fine-grained access control, your traffic is also authenticated by the Amazon ES security plugin, which makes the IAM authentication redundant. Create an open access policy, as shown in Figure 7, which doesn’t specify a principal and so doesn’t require request signing. This may be acceptable, since you can choose to require an authenticated identity on all traffic. The security plugin authenticates the traffic as above, providing access control based on the internal database.

Figure 7: Selected open access policy

Figure 7: Selected open access policy

Encrypted data

Amazon ES provides an option to encrypt data in transit and at rest for any domain. When you enable fine-grained access control, you must use encryption with the corresponding checkboxes automatically checked and not changeable. These include Transport Layer Security (TLS) for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through AWS Key Management Service (KMS). Shown in Figure 8.

Figure 8: Enabled encryption

Figure 8: Enabled encryption

Accessing Kibana

When you complete the domain creation wizard, it takes about 10 minutes for your domain to activate. Return to the console and the Overview tab of your Amazon ES dashboard. When the Domain Status is Active, select the Kibana URL. Since you created your domain in your VPC, you must be able to access the Kibana endpoint via proxy, VPN, SSH tunnel, or similar. Use the master user name and password that you configured earlier to log in to Kibana, as shown in Figure 9. As detailed above, you should only ever log in as the master user to set up additional users—administrators, users with read-only access, and others.

Figure 9: Kibana login page

Figure 9: Kibana login page


Congratulations, you now know the basic steps to set up the minimum configuration to access your Amazon ES domain with a master user. You can examine the settings for fine-grained access control in the Kibana console Security tab. Here, you can add additional users, assign permissions, map IAM users to security roles, and set up your Kibana tenancy. We’ll cover those topics in future posts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Jon Handler

Jon is a Principal Solutions Architect at AWS. He works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.


Sajeev Attiyil Bhaskaran

Sajeev is a Senior Cloud Engineer focused on big data and analytics. He works with AWS customers to provide architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions. He also does onsite and online sessions for customers to design best solutions for their use cases. In his free time, he enjoys spending time with his wife and daughter.

A deep dive into high-cardinality anomaly detection in Elasticsearch

Post Syndicated from Kaituo Li original https://aws.amazon.com/blogs/big-data/a-deep-dive-into-high-cardinality-anomaly-detection-in-elasticsearch/

In May 2020, we announced the general availability of real-time anomaly detection for Elasticsearch. With that release we leveraged the Random Cut Forest (RCF) algorithm to identify anomalous behaviors in the multi-dimensional data streams generated by Elasticsearch queries. We focused on aggregation first, to enable our users to quickly and accurately detect anomalies in their data streams. However, consider the example data in the following table.

timestamp avg. latency region
12:00 pm 3.1 Seattle
12:00 pm 4.1 New York
12:00 pm 5.9 Berlin
12:01 pm 2.6 Seattle
12:01 pm 5.3 New York
12:01 pm 5.8 Berlin

The data consists of one data field, avg. latency, and one attribute or categorical field, time. If we want to perform anomaly detection on this data, we could take the following strategy:

  1. Separate the data by the region attribute to create a separate data stream entity for each region.
  2. Construct an anomaly detector for each entity. If the cardinality of the region attribute, that is the number of possible choices of region value, is small, we can create separate anomaly detectors by filtering on each possible value of region.

But what if the cardinality of this attribute is large? Or what if the set of possible values changes over time, such as a source IP address or product ID? The existing anomaly detection tool doesn’t scale well in this situation.

We define the high-cardinality anomaly detection (HCAD) problem as performing anomaly detection on a data stream where individual entities in the stream are defined by a choice of attribute. In this use case, our goal is to perform anomaly detection on each data stream defined by a particular choice of region. That is, the Seattle region produces its own latency data stream, as well as the New York and Berlin regions.

In this post, we dive into the motivation, design, and development of the HCAD capability. We begin with an in-depth description of the HCAD problem and its properties. We then share the details of our solution and the challenges and questions we encountered during our research and development. Finally, we describe the system and architecture of our solution, especially the components tackling scalability concerns.

High-cardinality anomaly detection

In this section, we elaborate on the definition of the HCAD problem. As described earlier, we can think of HCAD as a way to produce multiple data streams by defining each data stream by a particular choice of attribute. For each data stream, we want to perform streaming anomaly detection as usual—we want to detect anomalies relative to that individual data stream’s own history.

We define an individual entity by fixing a particular value of one or more attributes and aggregating values over all remaining attributes. In the language of table manipulation in SQL, each entity is defined using GROUPBY on an attribute. Specifically, a group of entities is defined by selecting one or more attributes, where each entity is given by the data fields for each particular value of those attributes. For example, applying GROUPBY to the region attribute in the preceding example data produces three data stream entities: one for Seattle, one for New York, and one for Berlin. Within each of these data streams, we want to find anomalies with respect to that data stream’s history.

This idea of defining a data stream entity by attribute values extends to multiple attribute fields. When multiple attribute fields exist but we group by only one of those attributes, we aggregate each entity over the remaining attribute fields. For example, suppose you have network traffic data consisting of two attribute fields and one data field. The attribute fields are a source_ip address and a dest_ip address. The data field is the number of bytes_transferred in that particular network transaction between the given two IP addresses. The following table gives an example of such a dataset.

timestamp source_ip dest_ip bytes_transferred
12:00 pm 138
12:00 pm 21
12:00 pm 5
12:01 pm 289
12:01 pm 10
12:02 pm 244
12:02 pm 16
12:02 pm 8

One way to define an entity is to group by both source IP address and destination IP address combinations. Under this method of defining entities, we end up with the following data streams.

entity: (source_ip, dest_ip) 12:00 pm 12:01 pm 12:02 pm 12:03 pm
(, 138 289 244
(, 25 10 16
(, 5 0 8

On the other hand, if we define an entity only by its source IP address, we aggregate bytes transferred over the possible destination IP addresses.

entity: (source_ip,) 12:00 pm 12:01 pm 12:02 pm 12:03 pm
(,) 159 299 260
(,) 5 0 8

HCAD is distinct from another anomaly detection technique called population analysis. The goal of population analysis is to discover entire entities with values and patterns distinct from other entities. For example, the bytes transferred data stream associated with the entity (, is much larger in value than either of the other entities. Assuming many entities exist with values in the range of 1 to 30, this entity is considered a population anomaly. An entity can be a population data stream contains no anomalies relative to its own history.

Depending on the way we define entities from attributes, the number of data stream entities changes. This is an important consideration with regards to scale and density of the data streams: grouping by too many attributes may leave you with entities that have too few observations for a meaningful data stream. This is not uncommon in real-world datasets. Even in a dataset with only one attribute, real-world data tends to adhere to a power-law scaling of data density. Simply put, the majority of data stream activity occurs in a minority of entities. There is likely a long tail of sparse entities. Given this observation, if the stream aggregation window is too small, there are many missing data points in these sparse entities.

Data stream models for HCAD

We described the HCAD problem, but how do we build a machine learning solution? Furthermore, how is this solution different from the currently available non-HCAD single-stream solution? In this section, we explain our process for model selection and why we arrived at using Random Cut Forests for the high-cardinality regime. We then address scalability problems by exploring RCF’s hyperparameter space. Finally, we address certain issues that arise when dealing with sparse data streams.

Model selection

Designing an HCAD solution has several scientific challenges. Whatever algorithmic solution we arrive at must satisfy several systems constraints:

  • The algorithm must work in a streaming context: aggregated feature queries are streaming in Elasticsearch and the anomaly detection models only receive each new feature aggregate one at a time
  • The HCAD solution must respect the business needs of the customer hardware and should have restricted CPU and memory impact
  • The solution should be scalable with respect to data throughput, number of entities, and number of nodes in the cluster
  • The algorithm must be unsupervised, because the goal is to classify anomalous data in a streaming context without any labeled training set

Our team identified three classes of anomaly detection model based on the relationship between number of entities and number of models:

  • 1:1 model – Each entity is given its own AD model. No data or anomaly information is shared between the models, but because the number of models scales with the number of entities, we must keep the model small to satisfy customer scaling needs.
  • N:1 model – A single AD model is responsible for detecting each entity’s anomalies. Deep learning-based AD models typically fall under this category.
  • N:K model – A subset of entities is assigned to one of several individual models. Typically, some clustering algorithm is used to determine an appropriate partition of entities by identifying common features in the data streams.

Each general class of solution has its own tradeoffs with respect to the ability to distribute across cluster nodes, scale with respect to the number of entities, and detect anomalies on benchmark datasets. After some analysis of these tradeoffs and experimentation, we decided on the 1:1 approach. Within this class of HCAD solution, there are many candidate data stream anomaly detection algorithms. We explored many of these algorithms and tested different lightweight models before deciding on using Random Cut Forests. RCF works particularly well across a wide variety of data stream behaviors. This fit well with our goal of providing support for as wide of a range of customer use cases as possible.

Scaling Random Cut Forests

To keep memory costs down when using RCFs as our AD model, we started by exploring the algorithm’s hyperparameter space. The model has three main hyperparameters:

  • T – Number of trees
  • S – Sample size per free
  • D – Shingle dimension

The RCF model size is O(TDS). Sample size per tree is related to expected anomaly rate, and based on our experiments with a wide variety of datasets, it was best to leave this hyperparameter at its default value of 256 from the single-stream solution. The dimensionality is a function of the customer input but also of the model’s shingle size. We discuss the role of shingle size in the next section. Primarily to satisfy the scaling and model size constraints of the HCAD system, we focused on studying the effect of the number of trees on algorithm performance.

Experiments show that 10 trees per forest gives acceptable results on benchmark datasets; a default number of 100 trees is used in the single-stream solution. In the original plugin, we chose this large number of trees to ensure that the model can keep an accurate sketch of a long enough period of data samples. In doing so, we can recognize long time-scale changes to the data stream. However, we found in our benchmark high-cardinality data streams that this large of a model is unnecessary and that 10 trees is often sufficient for summarizing each high-cardinality data stream’s statistics.

Our experiments measured the precision and recall on labeled data streams. Labels were of the form of anomaly windows: regions in time where an anomaly is known to occur at some point inside the window. A true positive is the positive identification of such a window by the anomaly detection method. Any positively predicted point outside a window is considered a false positive. For an example labeled dataset, see Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service.

Handling sparse data streams

As mentioned earlier, real-world high-cardinality datasets typically exhibit a power-law like distribution in entity activity. That is, a minority of the entities produce the majority of the data. The earlier source and destination IP address use case is an example: for many websites, the majority of traffic comes from a small collection of sources, whereas individual visitors make up a long tail of sparse activity. Under this assumption, the choice of shingle size is important in defining our entity data streams.

Shingling is a standard preprocessing technique for transforming a one-dimensional data stream xt​ into a d-dimensional data stream st​ by converting subsequences of length d into d-dimensional vectors: st = (xt−d+1​, …, xt−1​, xt​). The following diagram illustrates the shingling process using a shingle size of four.

These vectors, instead of the raw stream values, are then fed into the RCF model. In anomaly detection, using shingling has several benefits. First, shingling filters out small-scale noise in the data. Second, shingles allow the model to detect breaks in certain local patterns or frequency changes. That is, a shingled RCF model learns some of the local temporal behavior of your data stream.

From discussions with our customers and analysis of real-world anomalies, we realized that many customers are looking for distributional anomalies: values that are outside the normal range of values of a data stream. This is in contrast to contextual anomalies, where a data point is considered anomalous in the context of just the data stream’s local history. The following figure depicts this distinction. On the left is a plot of a data stream, and on the right is a histogram of the values attained by this stream in the time window shown. The red data point is a distributional anomaly because its value falls within a low-density regime of the value distribution. The orange data point, on the other hand, is a contextual anomaly: its value is commonly occurring within this span of time but the presence of a spike at this particular point in time is unexpected.

The use of a shingle dimension greater than one allows the RCF model to detect these contextual anomalies in addition to the distributional anomalies.

One challenge with using shingles, however, is how to handle missing data. When data is unavailable at a particular time t, the shingles at times t, t+1, …, t+d−1 cannot be constructed. This results in a delay in the model’s ability to report anomalies. Our the impact of the occasional missing datum by using interpolation. However, when a data stream is sparse, it’s unlikely that any shingle can be constructed, thus turning interpolation into a prediction problem. Whether or not shingling is appropriate for your data is a function of the aggregation window used in the Elasticsearch query and the entity data density.

Scaling anomaly detection in Elasticsearch

In this section, we deep dive into the engineering challenges encountered in building the HCAD tool, particularly regarding the scalability with respect to the number of entities. We first describe the challenges we faced. Then we explain how our HCAD solution balances scalability and resource usage. Finally, we collected these ideas into a description of the overall HCAD framework.

The challenge

As described earlier, our goal was to support filtering the data by attribute or categorical fields and create a separate model for each attribute or categorical value. After examining several real-world use cases, we needed the HCAD plugin to handle millions of categorical values. Processing this many unique values was a challenging scalability issue that affected several key resources:

  • Storage – At the extreme, with 100 1-minute interval detectors and millions of entities for each detector running on our evaluation workload, we have seen the checkpoint index reach up to 170 GB in 1 day.
  • Memory – Compared to the single-stream detector, we could decrease the model size by approximately 20 times by decreasing shingle size and the number of RCF trees. But the number of entities is unbounded.
  • CPU – A single-stream detector mostly runs serial processing. During an HC detector run, multiple entities compete for CPU cycles for model update and inference. The CPU time grows linearly relative to the number of entities processed in each AD job run.

Designing for scalability and resource control

Based on these scalability issues, we chose to extend the current AD architecture because it already had these attributes:

  • Easy to scale out
  • Powerful enough to handle unpredictable scaling requirements
  • Able to control resource usage

However, meeting these challenges for HCAD required three key changes to our existing AD architecture.

First, we placed embarrassingly parallel computations on multiple nodes instead of a coordinating node. The coordinating node acts as the start of the task workflow. It only fetches features and assigns each node in the cluster a portion of the features that is roughly the same in size for all nodes. Other nodes process the features, train and run local models, and write results. Therefore, increasing the number of nodes by a factor of K asymptotically increases the number of categorical values we can handle by the same factor.

Second, in a single-stream detector, the amount of memory used is proportional to the number of features and is fixed when the detector is defined. However, with the introduction of HCAD, the number of entities is not fixed and the number of active entities is likely to change. Therefore, the size of the required memory may continuously change in the lifetime of a detector. Caching can accommodate such requirements without the need to pre-allocate memory for a detector in a fixed amount. If enough memory exists, we create models for all entities and monitor anomalies. Otherwise, we cache the hottest entities’ models up to the amount that the cache memory can contain. For example, if our memory can host only 100 models and there are millions of entities, the maximum active entities in the cache are the hottest 100 entities. We maintain a time-decayed count of each entity. The cache uses this information to measure an entity’s hotness.

Finally, we implemented various strategies for combating the extra overhead of running HC detectors:

  • Rate limiting – We limit concurrent computations and throttle bursty usage. For example, when replacing models in the cache, the cache sends get and search requests to fetch and potentially train models. If there is bursty traffic to replace models, the number of requests might exceed Elasticsearch’s get and search thread pool’s maximum queue size and cause Elasticsearch to reject all get and search requests. We install rate limiting to restrict models’ replacing speed.
  • Active cleanup – This keeps resource usage under a safe level before it’s too late. For example, we keep checkpoints within 3 days. When any of the checkpoint shards is larger than 50 GB (recommended maximum shard size), we start deleting checkpoints more aggressively.
  • Minimizing space usage – For example, in single-stream anomaly detection, we record a model’s running results during each interval. An entity’s model may take time to get ready when there is not enough historical data for training. We don’t need to record such entities’ results because we won’t record anything useful other than that anomaly grade and confidence are both equal to zero. This optimization can reduce the result index size by 4–8 times in one of our experiments.


The following figure summarizes the HCAD architecture.

The end-to-end story of HCAD is as follows:

  1. A user wants to get alerts when an anomaly for a particular entity in the whole corpus arises (for example, high CPU usage on a host).
  2. The user creates an HCAD detector to describe the source data (index name), feature (for example, average CPU usage within an interval), and sampling frequency (for example, 1 minute).
  3. Based on the detector configuration, the AD plugin issues a query to fetch feature data for each host regularly (every 1 minute). Users don’t need to know what hosts to query for in the first place.
  4. A coordinating node infers the entities from the query result.
  5. The coordinating node distributes entities’ features to all nodes in the cluster.
  6. On each node, models are trained for the incoming entities, and anomaly grades are inferred, indicating how different the current CPU usage is from the trends that have recently been observed for the same hosts’ CPU usage.
  7. If cache memory is enough for all incoming entities, the cache admits entities’ models based on the entities’ hotness.

The Kibana workflow

In this section, we show how to use the HCAD in Kibana. Let’s imagine that we need to monitor the high or low CPU usage of our hosts. To do that, we create a detector, define its features, and choose a category field.

Creating a detector

To create and configure a detector, complete the following steps:

  1. On the navigation bar, choose Anomaly detection.
  2. Choose Create detector.

  1. Enter a name and description for the detector.

  1. Choose index or enter index pattern for the data source.
  2. For Timestamp field, choose a field so the detector can create a time series profile of the data.
  3. If you want the detector to ignore specific data (such as invalid CPU usage number), you can configure a data filter.

  1. Specify time frames for detection interval and window delay.

Window delay time should be a conservative estimate. Otherwise, the detector may query for documents within an interval that has not been indexed yet. For this post, we want to have an average CPU usage per minute, and we expect the index processing time to be 1 minute at most.

Defining features

In addition to the preceding settings, we need to add features. The detector aggregates a set of values within a time interval (shingle) to compute the single value according to the feature definition.

  1. Choose Configure model.

  1. For Feature name, enter a name.
  2. Specify your aggregation functions and fields.

We provide five built-in single metric aggregations: Min, Max, Sum, Average, and Count. You can add a customized aggregation by choosing Custom expression for Find anomalies based on. For this post, we add a feature that returns the average of CPU usage values.

As mentioned earlier, you can customize the aggregation method as long as it returns a single value. For example, when a DevOps engineer wants to monitor the count of distinct IPs accessing their company’s Amazon Simple Storage Service (Amazon S3) buckets, they can define a cardinality aggregation that counts unique source IPs.

Choosing a category field

The host-cloudwatch index in our example has CPU usage per host per minute. We can define a single-stream detector to model all of the hosts’ average CPU usage together. But if each host’s CPU values have different distributions, we can split the hosts’ time series and model them separately. Giving each categorical value a separate baseline is the main change that HCAD introduces.

Previewing and starting the detector

You might want to try out different choices of detector configurations and feature definitions before finalizing them. You can use Sample anomalies for iterative experiments.

Start the detector by choosing  Save and start detector. After confirming, the anomaly detector starts collecting data in real time and performing detection.

The detector starts in an initializing state.

We can use the profile API to check initialization progress (see the following code). A detector is initialized if its hottest entities’ models are fully initialized and ready to emit anomaly grade. Because the hottest entity may change, the initialization progress may go backward.

GET _opendistro/_anomaly_detection/detectors/stf6vnUB0XEggmgl3TCj/_profile/init_progress

    "init_progress": {
        "percentage": "92%",
        "estimated_minutes_left": 10,
        "needed_shingles": 10

After the detector runs for a while, we can check its result on the detector’s Anomaly results tab. The following heatmap gives an overview of anomalies per entity across a timeline, by showing the hostname along the Y-axis and the timeline along the X-axis. A colored block means there is an anomaly, and a gray block means there is no anomaly.

Choosing one of the blocks shows you a more detailed view of the anomaly grade and confidence and the feature values causing the anomalies. We can observe the detector reports anomalies between 4:30 and 4:50 because the CPU usage is approaching 100%.

The time series of the host-cloudwatch index confirms host i-WrSNK7zgys has a CPU usage spike between 4:30–4:50.

We can set up alerts for the detection results. For instructions, see Anomaly Detection.


General-purpose anomaly detection is challenging. Earlier in 2020, we launched a tool that can find anomalies in your feature queries. In this work, we extended the anomaly detection capabilities to the high-cardinality case. We can now find anomalies within your data when the data contains attribute or categorical fields. Our solution can discover anomalies across many entities defined by these attribute values and also scale with respect to increasing or decreasing number of entities in your data. We look forward to hearing your questions, comments, and feedback.

About the Authors

Kaituo Li is an engineer in Amazon Elasticsearch Service. He has worked on distributed systems, applied machine learning, monitoring, and database storage in Amazon. Before Amazon, Kaituo was a PhD student in Computer Science at the University of Massachusetts Amherst. He likes reading, watching TV, and sports.



Chris Swierczewski is an applied scientist at AWS. He enjoys reading, weightlifting, painting, and board games.

Automating Index State Management for Amazon ES

Post Syndicated from Satya Vajrapu original https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-es/

When it comes to time-series data, it’s more common to access new data over existing data, such as the last 4 hours or 1 day. Often, application teams are tasked with maintaining multiple indexes for diverse data workloads, which brings new requirements to set up a custom solution to manage the indexes’ lifecycle. This becomes tedious as the indexes grow and result in housekeeping overheads.

Amazon Elasticsearch Service (Amazon ES) now enables you to automate recurring index management activities. This avoids using any additional tools to manage the index lifecycle inside Elasticsearch. With Index State Management (ISM), you can create a policy that automates these operations based on index age, size, and other conditions, all from within your Amazon ES domain.

In this post, I discuss how you can implement a sample policy to automate routine index management tasks and apply them to indexes and index patterns.


Before you get started, make sure you complete the following prerequisites:

  1. Have Elasticsearch 6.8 or later (required to use ISM and Ultrawarm).
  2. Set up a new Amazon ES domain with UltraWarm enabled.
  3. Make sure your user role has sufficient permissions to access the Kibana console of the Amazon ES domain. If required, validate and configure the access to your domains.

Use case

Ultrawarm for Amazon ES is a new low-cost storage tier that provides fast, interactive analytics on up to three petabytes of log data at one-tenth of the cost of the current Amazon ES storage tier. Although hot storage is used for indexing and providing fastest access, Ultrawarm complements the hot storage tier by providing less expensive storage for older and less-frequently accessed data, all while maintaining the same interactive analytics experience. Rather than attached storage, UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) and a sophisticated caching solution to improve performance.

To demonstrate the functionality, I present a sample use case of handling time-series data. In this use case, we migrate a set of existing indexes that are initially in hot state and migrate them to Ultrawarm storage after a day. Upon migration, the data is stored in a service-managed S3 bucket as read only. We then delete the index after 2 days, assuming that index is no longer needed.

After we create the Amazon ES domain, complete the following steps:

  1. Log in using the Kibana UI endpoint.
  2. Wait for the domain status to turn active and choose the Kibana endpoint.
  3. On Kibana’s splash page, add all the sample data listed by choosing Try our sample data and choosing Add data.
  4. After adding the data, choose Index Management (the IM icon on the left navigation pane), which lands into the Index Policies page.
  5. Choose Create policy.
  6. For Name policy, enter ism-policy-sample.
  7. Replace the default policy with the following code:
        "policy": {
            "description": "Lifecycle Management Policy",
            "default_state": "hot",
            "states": [
                    "name": "hot",
                    "actions": [],
                    "transitions": [
                            "state_name": "warm",
                            "conditions": {
                                "min_index_age": "1d"
                    "name": "warm",
                    "actions": [
                            "retry": {
                                "count": 5,
                                "backoff": "exponential",
                                "delay": "1h"
                            "warm_migration": {}
                    "transitions": [
                            "state_name": "delete",
                            "conditions": {
                                "min_index_age": "2d"
                    "name": "delete",
                    "actions": [
                            "notification": {
                                "destination": {
                                    "chime": {
                                        "url": "<CHIME_WEBHOOK_URL>"
                                "message_template": {
                                    "source": "The index {{ctx.index}} is being deleted because of actioned policy {{ctx.policy_id}}",
                                    "lang": "mustache"
                            "delete": {}
                    "transitions": []

You can also use the ISM operations to programmatically work with policies and managed indexes. For example, to attach an ISM policy to an index at the time of creation, you invoke an API action. See the following code:

PUT index_1
  "settings": {
    "opendistro.index_state_management.policy_id": "ingest_policy",
    "opendistro.index_state_management.rollover_alias": "some_alias"
  "aliases": {
    "some_alias": {
      "is_write_index": true

In this case, the ingest_policy is applied to index_1 with the rollover action defined in some_alias. For the list of complete ISM programmatic operations to work with policies and managed policies, see ISM API.

  1. Choose Create. You can now see your index policy on the Index Policies page.
  2. On the Indices page, search for kibana_sample, which should list all the sample data indexes you added earlier.
  3. Select all the indexes and choose Apply policy.
  4. From the Policy ID drop-down menu, choose the policy created in the previous step.
  5. Choose Apply.

The policy is now assigned and starts managing the indexes. On the Managed Indices page, you can observe the status as Initializing.

When initialization is complete, the status changes to Running.

You can also set a refresh frequency to refresh the managed indexes’ status information.

Demystifying the policy

In this section, I explain about the index policy tenets and how they’re structured.

Policies are JSON documents that define the following:

  • The states an index can be in
  • Any actions you want the plugin to take when an index enters the state
  • Conditions that must be met for an index to move or transition into a new state

The policy document begins with basic metadata like description, the default_state the index should enter, and finally a series of state definitions.

A state is the status that the managed index is currently in. A managed index can only be in one state at a time. Each state has associated actions that are run sequentially upon entering a state and transitions that are checked after all the actions are complete.

The first state is hot. In this use case, no actions are defined in this hot state; the managed indexes land in this state initially and then transition to warm. Transitions define the conditions that need to be met for a state to change (in this case, change to warm after the index crosses 24 hours). See the following code:

                "name": "hot",
                "actions": [],
                "transitions": [
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "1d"

We can quickly verify the states on the console. The current state is hot and attempts the transition to warm after 1 day. The transition typically completes within an hour and is reflected under the Status column.

The warm state has actions defined to move the index to Ultrawarm storage. When the actions run successfully, the state has another transition to delete after the index ages 2 days. See the following code:

                "name": "warm",
                "actions": [
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        "warm_migration": {}
                "transitions": [
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "2d"

You can again verify the status of the managed indexes.

You can also verify this on the Amazon ES console under the Ultrawarm Storage usage column.

The third state of the policy document marks the indexes to delete based on the actions. This policy state assumes your index is non-critical and no longer receiving write requests; having zero replicas carries some risk of data loss.

The final delete state has two actions defined. The first action is self-explanatory; it sends a notification as defined in the message_template to the destination. See the following code:

                "name": "delete",
                "actions": [
                        "notification": {
                            "destination": {
                                "chime": {
                                    "url": "<CHIME_WEBHOOK_URL>"
                            "message_template": {
                                "source": " The index {{ctx.index}} is being deleted because of actioned policy {{ctx.policy_id}}",
                                "lang": "mustache"
                        "delete": {}
                "transitions": []

I have configured the notification endpoint to be on Amazon Chime <CHIME_WEBHOOK_URL>. For more information about using webhooks, see Webhooks for Amazon Chime.

You can also configure the notification to send to destinations like Slack or a webhook URL.

At this state, I have received the notification on the Chime webhook (see the following screenshot).

The following screenshot shows the index status on the console.

After the notification is successfully sent, the policy runs the next action in the state that is deleting the indexes. After this final state, the indexes no longer appear on the Managed Indices page.

Additional information on ISM policies

If you have an existing Amazon ES cluster with no Ultrawarm support (because of any missing prerequisite), you can use policy operations read_only and reduces_replicas to replace the warm state. The following code is the policy template for these two states:

                "name": "reduce_replicas",
                "actions": [{
                  "replica_count": {
                    "number_of_replicas": 0
                "transitions": [{
                  "state_name": "read_only",
                  "conditions": {
                    "min_index_age": "2d"
                "name": "read_only",
                "actions": [
                        "read_only": {}
                "transitions": [
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "3d"


In this post, you learned how to use the index state management feature on Ultrawarm for Amazon ES. The walkthrough illustrated how to manage indexes using this plugin with a sample lifecycle policy.

For more information about the ISM plugin, see Index State Management. If you need enhancements or have other feature requests, please file an issue. To get involved with the project, see Contributing Guidelines.

A big takeaway for me as I evaluated the ISM plugin in Amazon ES was that the ISM plugin is fully compatible and works on Open Distro for Elasticsearch. For more information, see Index State Management in Open Distro for Elasticsearch. It can be useful for using Open Distro for Elasticsearch as an on-premises or internal solution while using a managed service for your production workloads.

About the Author

Satya Vajrapu is a DevOps Consultant with Amazon Web Services. He works with AWS customers to help design and develop various practices and tools in the DevOps toolchain.

Normalize data with Amazon Elasticsearch Service ingest pipelines

Post Syndicated from Vijay Injam original https://aws.amazon.com/blogs/big-data/normalize-data-with-amazon-elasticsearch-service-ingest-pipelines/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost-effectively at scale. Search and log analytics are the two most popular use cases for Amazon ES. In log analytics at scale, a common pattern is to create indexes from multiple sources. In these use cases, how can you ensure that all the incoming data follows a specific, predefined format if it’s operationally not feasible to apply checks in each data source? You can use Elasticsearch ingest pipelines to normalize all the incoming data and create indexes with the predefined format.

What’s an ingest pipeline?

An ingest pipeline lets you use some of your Amazon ES domain processing power to apply to a set of processors during indexing. Ingest pipeline applies processors in order, the output of one processor moving to the next processor in the pipe. You define a pipeline with the Elasticsearch _ingest API. The following screenshot illustrates this architecture.

To find the available ingest processors in your Amazon ES domain, enter the following code:

GET _ingest/pipeline/

Solution overview

In this post, we discuss three log analytics use cases where data normalization is a common technique.

We create three pipelines and normalize the data for each use case. The following diagram illustrates this architecture.

Use case 1

In this first use case, Amazon ES domain has three sources: logstash, Fluentd, and AWS Lambda. Your logstash source sends the data to an index with the name index-YYYY.MM.DD.HH (hours in the end). When you have an error in the Fluentd source, it creates the index named index-YYYY.MM.DD (missing the hours). Your domain creates indexes for both the formats, which is not what you intended.

One way to correct the index name is to calculate the hours of the ingested data and assign the value to the index. If you can’t identify any pattern, or identify further issues to the indexing name, you need to segregate the data to a different index (for example, format_error) for further analysis.

Use case 2

If your application uses time-series data and analyzes data from fixed time windows, your data sources can sometimes send data from a prior time window. In this use case, you need to check for the incoming data and discard data that doesn’t fit in the current time window.

Use case 3

In some use cases, the value for a key can contain large strings with common prefixes. End-users typically use wild card characters (*) with the prefix to search on these fields. If your application or Kibana dashboards contain several wild card queries, it can increase CPU utilization and overall search lateness. You can address this by identifying the prefixes from the values and creating a new field with the data type as a keyword. You can use Term queries for the keywords and improve search performance.

Pipeline 1: pipeline_normalize_index

The default pipeline for incoming data is pipeline_normalize_index. This pipeline performs the following actions:

  • Checks if the incoming data belongs to the current date.
  • Checks if the data has any errors in the index name.
  • Segregates the data:
    • If it doesn’t find any errors, it pushes the data to pipeline_normalize_data.
    • If it finds errors, it pushes the pipeline to pipeline_fix_index.

Checking the index date

In this step, you can create an index pipeline using a script processor, which lets you create a script and execute within the pipeline.

Use the Set processor to add _ingest.timestamp to doc_received_date and compare the index date to the document received date. The script processor lets you create a script using painless scripts. You can create a script to check if the index date matches the doc_received_date. The script processor let you access the ingest document using the ctx variable. See the following code:

            "source": """
                    DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy.MM.dd");
                    String dateandhour = ctx._index.substring(ctx._index.indexOf('-') + 1);
                    LocalDate indexdate = LocalDate.parse(dateandhour.substring(0, 10), formatter);
                    ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
                    LocalDate doc_received_date = zonedDateTime.toLocalDate();
                    if (doc_received_date.isEqual(indexdate)) {
                        ctx.index_purge = "N";
                    } else {
                        ctx.index_purge = "Y";
                    if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";

Checking for index name errors

You can use the same script processor from the previous step to check if the index name matches the format index-YYYY.MM.DD.HH or index-YYYY.MM.DD. See the following code:

if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";

Segregating the data

If the index date doesn’t match the _ingest.timestamp, you can drop the request using the drop processor. If the index name doesn’t match the format index-YYYY.MM.DD, you can segregate the data to pipeline pipeline_verify_index_date and proceed to the pipeline pipeline_normalize_data. If conditions aren’t met, you can proceed to the pipeline pipeline_indexformat_errors or assign a default index indexing_errors. If no are issues found, you proceed to the pipeline pipeline_normalize_data. See the following code:

            "if":"ctx.indexformat_error == 'Y'",

The following code is an example pipeline:

PUT _ingest/pipeline/pipeline_normalize_index
            "source": """
                    DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy.MM.dd");
                    String dateandhour = ctx._index.substring(ctx._index.indexOf('-') + 1);
                    LocalDate indexdate = LocalDate.parse(dateandhour.substring(0, 10), formatter);
                    ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
                    LocalDate doc_received_date = zonedDateTime.toLocalDate();
                    if (doc_received_date.isEqual(indexdate)) {
                        ctx.index_purge = "N";
                    } else {
                        ctx.index_purge = "Y";
                    if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";
                  "value":"at Script processor - Purge older Index or Index date error"
            "if":"ctx.index_purge == 'Y'"
            "if":"ctx.indexformat_error == 'Y'",

Pipeline 2: pipeline_normalize_data

The pipeline pipeline_normalize_data fixes index data. It extracts the prefix from the defined field and creates a new field. You can use the new field for Term queries.

In this step, you can use a grok processor to extract prefixes from the existing fields and create a new field that you can use for term queries. The output of this pipeline creates the index. See the following code of an example pipeline:

PUT _ingest/pipeline/pipeline_normalize_data
                    "value":"application_type error"

Pipeline 3: pipeline_fix_index

This pipeline fixes the index name. The indexing errors identified in pipeline_normalize_Index are the incoming data points for this pipeline. pipeline_fix_index extracts the hours from the _ingest.timestamp and appends it to the index name.

The index name errors identified from Pipeline 1 are the data source for this pipeline. You can use the script processor to write a painless script. The script extracts hours (HH) from the _ingest.timestamp and appends it to the _index. See the following code of the example pipeline:

PUT _ingest/pipeline/pipeline_fix_index_name
               ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
               LocalDate doc_received_date = zonedDateTime.toLocalDate();
               String receiveddatehour = zonedDateTime.getHour().toString();
               if (zonedDateTime.getHour() < 10) {
                    receiveddatehour = "0" + zonedDateTime.getHour();
               ctx._index = ctx._index + "." + receiveddatehour;
                        "value":"at Script processor - Purge older Index or Index date error"

Adding the default pipeline to the index template

After creating all the pipelines, add the default pipeline to the index template. See the following code:

"default_pipeline" : "pipeline_normalize_index"


You can normalize data, fix indexing errors, and segregate operation data and anomalies by using ingest pipelines. Although you can use one pipeline with several processors (depending on the use case), indexing pipelines provides an efficient way to utilize compute resources and operational resources by eliminating unwanted indexes.


About the Author

Vijay Injam is a Data Architect with Amazon Web Services.






Kevin Fallis is an AWS specialist search solutions architect. His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

  1. A Lambda function is invoked.
  2. AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
  3. AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
  4. The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:


The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

  • A: the young generation
  • B: the old generation
  • C: the metaspace
  • D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

launch stack button

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

  • In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
  • In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
  • The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
  • The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
  • The GC Duration show the sum of the duration of all displayed garbage collection activities.
  • The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.


The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

  • An S3 event from a new uploaded image triggers this function.
  • The function loads the image from S3, resizes it, and puts the resized version to S3.
  • The file size of the example image is close to 2.8MB.
  • The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.


In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Analyzing Amazon S3 server access logs using Amazon ES

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/analyzing-amazon-s3-server-access-logs-using-amazon-es/

When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. With the advent of increased data storage needs, you can rely on Amazon S3 for a range of use cases and simultaneously looking for ways to analyze your logs to ensure compliance, perform the audit, and discover risks.

Amazon S3 lets you monitor the traffic using the server access logging feature. With server access logging, you can capture and monitor the traffic to your S3 bucket at any time, with detailed information about the source of the request. The logs are stored in the S3 bucket you own in the same Region. This addresses the security and compliance requirements of most organizations. The logs are critical for establishing baselines, analyzing access patterns, and identifying trends. For example, the logs could answer a financial organization’s question about how many requests are made to a bucket and who is making what type of access requests to the objects.

You can discover insights from server access logs through several different methods. One common option is by using Amazon Athena or Amazon Redshift Spectrum and query the log files stored in Amazon S3. However, this solution poses high latency with an exponential growth in volume. It requires further integration with Amazon QuickSight to add visualization capabilities.

You can address this by using Amazon Elasticsearch Service (Amazon ES). Amazon ES is a managed service that makes it easier to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with other AWS services such as Amazon S3 and Amazon Kinesis for loading streaming data into Amazon ES.

This post walks you through automating ingestion of server access logs from Amazon S3 into Amazon ES using AWS Lambda and visualizing the data in Kibana.

Architecture overview

Server access logging is enabled on source buckets, and logs are delivered to access log bucket. The access log bucket is configured to send an event to the Lambda function when a log file is created. On an event trigger, the Lambda function reads the file, processes the access log, and sends it to Amazon ES. When the logs are available, you can use Kibana to create interactive visuals and analyze the logs over a time period.

When designing a log analytics solution for high-frequency incoming data, you should consider buffering layers to avoid instability in the system. Buffering helps you streamline processes for unpredictable incoming log data. For such use cases, you can take advantage of managed services like Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Streaming services buffer data before delivering it to Amazon ES. This helps you avoid overwhelming your cluster with spiky ingestion events. Kinesis Data Firehose can reliably load data into Amazon ES. Kinesis Data Firehose lets you choose a buffer size of 1–100 MiBs and a buffer interval of 60–900 seconds when Amazon ES is selected as the destination. Kinesis Data Firehose also scales automatically to match the throughput of your data and requires no ongoing administration. For more information, see Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose.

The following diagram illustrates the solution architecture.


Before creating resources in AWS CloudFormation, you must enable server access logging on the source bucket. Open the S3 bucket properties and look for Amazon S3 access and delivery bucket. See the following screenshot.

You also need an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. The user must have access to create IAM roles and policies via the CloudFormation template.

Setting up the resources with AWS CloudFormation

First, deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

  1. Sign in to the console and choose the Region of the bucket storing the access log. For this post, I use us-east-1.
  2. Launch the stack:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. On the Parameters page, enter the following parameters:
    1. VPC Configuration – Select any VPC that has at least two private subnets. The template deploys the Amazon ES service domain and Lambda within the VPC.
    2. Private subnets – Select two private subnets of the VPC. The route tables associated with subnets must have a NAT gateway configuration and VPC endpoint for Amazon S3 to privately connect the bucket from Lambda.
    3. Access log S3 bucket – Enter the S3 bucket where access logs are delivered. The template configures event notification on the bucket to trigger the Lambda function.
    4. Amazon ES domain name – Specify the Amazon ES domain name to be deployed through the template.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Acknowledge resource creation under Capabilities and transforms and choose Create.

The stack takes about 10–15 minutes to complete. The CloudFormation stack does the following:

  • Creates an Amazon ES domain with fine-grained access control enabled on it. Fine-grained access control is configured with a primary user in the internal user database.
  • Creates IAM role for the Lambda function with required permission to read from S3 bucket and write to Amazon ES.
  • Creates Lambda within the same VPC of Amazon ES elastic network interfaces (ENI). Amazon ES places an ENI in the VPC for each of your data nodes. The communication from Lambda to the Amazon ES domain is via this ENI.
  • Configures file create event notification on Access log S3 bucket to trigger the Lambda function. The function code segments are discussed in detail in this GitHub project.

You must make several considerations before you proceed with a production-grade deployment. For this post, I use one primary shard with no replicas. As a best practice, we recommend deploying your domain into three Availability Zones with at least two replicas. This configuration lets Amazon ES distribute replica shards to different Availability Zones than their corresponding primary shards and improves the availability of your domain. For more information about sizing your Amazon ES, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

We recommend setting the shard count based on your estimated index size, using 50 GB as a maximum target shard size. You should also define an index template to set the primary and replica shard counts before index creation. For more information about best practices, see Best practices for configuring your Amazon Elasticsearch Service domain.

For high-frequency incoming data, you can rotate indexes either per day or per week depending on the size of data being generated. You can use Index State Management to define custom management policies to automate routine tasks and apply them to indexes and index patterns.

Creating the Kibana user

With Amazon ES, you can configure fine-grained users to control access to your data. Fine-grained access control adds multiple capabilities to give you tighter control over your data. This feature includes the ability to use roles to define granular permissions for indexes, documents, or fields and to extend Kibana with read-only views and secure multi-tenant support. For more information on granular access control, see Fine-Grained Access Control in Amazon Elasticsearch Service.

For this post, you create a fine-grained role for Kibana access and map it to a user.

  1. Navigate to Kibana and enter the primary user credentials:
    1. User nameadminuser01
    2. Password[email protected]

To access Kibana, you must have access to the VPC. For more information about accessing Kibana, see Controlling Access to Kibana.

  1. Choose Security, Roles.
  2. For Role name, enter kibana_only_role.
  3. For Cluster-wide permissions, choose cluster_composite_ops_ro.
  4. For Index patterns, enter access-log and kibana.
  5. For Permissions: Action Groups, choose read, delete, index, and manage.
  6. Choose Save Role Definition.
  7. Choose Security, Internal User Database, and Create a New User.
  8. For Open Distro Security Roles, choose Kibana_only_role (created earlier).
  9. Choose Submit.

The user kibanauser01 now has full access to Kibana and access-logs indexes. You can log in to Kibana with this user and create the visuals and dashboards.

Building dashboards

You can use Kibana to build interactive visuals and analyze the trends and combine the visuals for different use cases in a dashboard. For example, you may want to see the number of requests made to the buckets in the last two days.

  1. Log in to Kibana using kibanauser01.
  2. Create an index pattern and set the time range
  3. On the Visualize section of your Kibana dashboard, add a new visualization.
  4. Choose Vertical Bar.

You can select any time range and visual based on your requirements.

  1. Choose the index pattern and then configure your graph options.
  2. In the Metrics pane, expand Y-Axis.
  3. For Aggregation, choose Count.
  4. For Custom Label, enter Request Count.
  5. Expand the X-Axis
  6. For Aggregation, choose Terms.
  7. For Field, choose bucket.
  8. For Order By, choose metric: Request Count.
  9. Choose Apply changes.
  10. Choose Add sub-bucket and expand the Split Series
  11. For Sub Aggregation, choose Date Histogram.
  12. For Field, choose requestdatetime.
  13. For Interval, choose Daily.
  14. Apply the changes by choosing the play icon at the top of the page.

You should see the visual on the right side, similar to the following screenshot.

You can combine graphs of different use cases into a dashboard. I have built some example graphs for general use cases like the number of operations per bucket, user action breakdown for buckets, HTTPS status rate, top users, and tabular formatted error details. See the following screenshots.

Cleaning up

Delete all the resources deployed through the CloudFormation template to avoid any unintended costs.

  1. Disable the access log on source bucket.
  2. On to the CloudFormation console, identify the stacks appropriately, and delete


This post detailed a solution to visualize and monitor Amazon S3 access logs using Amazon ES to ensure compliance, perform security audits, and discover risks and patterns at scale with minimal latency. To learn about best practices of Amazon ES, see Amazon Elasticsearch Service Best Practices. To learn how to analyze and create a dashboard of data stored in Amazon ES, see the AWS Security Blog.

About the Authors

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.





Power data analytics, monitoring, and search use cases with the Open Distro for Elasticsearch SQL Engine on Amazon ES

Post Syndicated from Viraj Phanse original https://aws.amazon.com/blogs/big-data/power-data-analytics-monitoring-and-search-use-cases-with-the-open-distro-for-elasticsearch-sql-engine-on-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a popular choice for log analytics, search, real-time application monitoring, clickstream analysis, and more. One commonality among these use cases is the need to write and run queries to obtain search results at lightning speed. However, doing so requires expertise in the JSON-based Elasticsearch query domain-specific language (Query DSL). Although Query DSL is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

To solve this problem, we provided the Open Distro for Elasticsearch SQL Engine on Amazon ES, which we have been expanding since the initial release. The Structured Query Language (SQL) engine is powered by Open Distro for Elasticsearch, an Apache 2.0 licensed distribution of Elasticsearch. For more information about the Open Distro project, see Open Distro for Elasticsearch. For more information about the SQL engine capabilities, see SQL.

As part of this continued investment, we’re happy to announce new capabilities, including a Kibana-based SQL Workbench and a new SQL CLI that makes it even easier for Amazon ES users to use the Open Distro for Elasticsearch SQL Engine to work with their data.

SQL is the de facto standard for data and analytics and one of the most popular languages among data engineers and data analysts. Introducing SQL in Amazon ES allows you to manifest search results in a tabular format with documents represented as rows, fields as columns, and indexes as table names, respectively, in the WHERE clause. This acts as a straightforward and declarative way to represent complex DSL queries in a readable format. The newly added tools can act as a powerful yet simplified way to extract and analyze data, and can support complex analytics use cases.

Features overview

The following is a brief overview of the features of Open Distro for Elasticsearch SQL Engine on Amazon ES:

  • Query tools
    • SQL Workbench – A comprehensive and integrated visual tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. The following screenshot shows a query on the SQL Workbench page.

  • SQL CLI – An interactive, standalone command line tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. For following screenshot shows a query on the CLI.

  • Connectors and drivers
    • ODBC driver – The Open Database Connectivity (ODBC) driver enables connecting with business intelligence (BI) applications such as Tableau and exporting data to CSV and JSON.
    • JDBC driver – The Java Database Connectivity (JDBC) driver also allows you to connect with BI applications such as Tableau and export data to CSV and JSON.
  • Query support
    • Basic queries – You can use the SELECT clause, along with FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT to search and aggregate data.
    • Complex queries – You can perform complex queries such as subquery, join, and union on more than one Elasticsearch index.
    • Metadata queries – You can query basic metadata about Elasticsearch indexes using the SHOW and DESCRIBE commands.
  • Delete support
    • Delete – You can delete all the documents or documents that satisfy predicates in the WHERE clause from search results. However, it doesn’t delete documents from the actual Elasticsearch index.
  • JSON and full-text search support
    • JSON – Support for JSON by following PartiQL specification, a SQL-compatible query language, lets you query semi-structured and nested data for any data format.
    • Full-text search support – Full-text search on millions of documents is possible by letting you specify the full range of search options using SQL commands such as match and score.
  • Functions and operator support
    • Functions and operators – Support for string functions and operators, numeric functions and operators, and date-time functions is possible by enabling fielddata in the document mapping.
  • Settings
    • Settings – You can view, configure, and modify settings to control the behavior of SQL without needing to restart or bounce the Elasticsearch cluster.
  • Interfaces
    • Endpoints – The explain endpoint allows translating SQL into Query DSL, and the cursor helps obtain a paginated response for the SQL query result.
  • Monitoring
    • Monitoring – You can obtain node-level statistics by using the stats endpoint.
  • Request and response protocols


Open Distro for Elasticsearch SQL Engine on Amazon ES provides a comprehensive, flexible, and user-friendly set of features to obtain search results from Amazon ES in a declarative manner using SQL. For more information about querying with SQL, see SQL Support for Amazon Elasticsearch Service.


About the Author

Viraj Phanse (@vrphanse) is a product management leader at Amazon Web Services for Search Services/Analytics. Prior to AWS, he was in product management/strategy and go-to-market leadership roles at Oracle, Aerospike, INSZoom and Persistent Systems. He is a Fellow and Selection Committee member at Berkeley Angel Network, and a Big Data Advisory Board Member at San Francisco State University. He has completed his M.S. in Computer Science from UCLA and MBA from UC Berkeley’s Haas School of Business.



Creating customized Vega visualizations in Amazon Elasticsearch Service

Post Syndicated from Markus Bestehorn original https://aws.amazon.com/blogs/big-data/creating-customized-vega-visualizations-in-amazon-elasticsearch-service/

Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.

Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:

  • The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
  • Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
  • Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.

This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.

Use case

For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample: - - [22/Jan/2019:03:56:19 +0330] "GET /product/14926 HTTP/1.1" 404 33617 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Each entry contains a source IP address (for the preceding data sample,, the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.

In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:

  • Requests that result in high bandwidth use
  • Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic

The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.

We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.

Overview of solution

To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.

The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.

Prerequisites and deploying your CloudFormation template

To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:

  1. Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
  1. Keep all the default parameters as is.
  2. Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  3. Choose Create Stack to start the deployment.

The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.

Connecting to the Kibana

After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:

  1. On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.

This opens the Kibana login prompt.

  1. Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:

Amazon Cognito requires you to change the password, which Kibana prompts you for.

  1. Enter a new password and a name for yourself.

Remember this password; there is no recovery option in this setup if you need to re-authenticate.

Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).

You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.

Creating an index pattern and exploring the data

Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:

  1. On the Kibana home screen, choose Discover.
  2. For Index Pattern, enter “vega*”.
  3. Choose Next step.

  1. For Time Filter field name, choose timestamp.
  2. Choose Create index pattern.

After a few seconds, a summary page with details about the created index pattern appears.

The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.

In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:

size_in_bytes: { "size": 13000, "freq": 11 }, …

The following screenshot shows both the visualization and raw data.

As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.

Vega on-the-fly data transformation

To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:

  1. Variable initialization
  2. Mapping documents to a key-value structure for the histogram
  3. Grouping values into arrays by keys
  4. Aggregation of array content into the histogram

To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:

POST _scripts/initialize_variables
  "script": {
    "lang" : "painless",
    "source" : "state.test = [:]"

Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).

To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:

  “timestamp”: “2019/01/26 12:58:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 369},
    {“size”: 200, “freq”: 62},
  “timestamp”: “2019/01/26 12:57:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 386},
    {“size”: 200, “freq”: 60},

The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.

POST _scripts/map_documents
  "script": {
    "lang": "painless",
    "source": """
      if (doc.containsKey(params.bucketField))
        for (int i=0; i<doc[params.bucketField].length; i++)
          def key = doc[params.bucketField][i];
          def value = doc[params.countField][i];
          state.test[(key).toString()] = value;

Using our example input data results for this step, the following output is computed:

{“100”: 369, “200”: 62, …}
{“100”: 386, “200”: 60, …}

As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:

POST _scripts/combine_documents
  "script": {
    "lang": "painless",
    "source": "return state.test"

With the example intermediate output from the previous step, the following intermediate output is computed:

  “100”: [369, 386],
  “200”: [62, 60], …

Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:

POST _scripts/aggregate_histogram_buckets
  "script": {
    "lang": "painless",
    "source": """
      Map result = [:];
      states.forEach(value ->
        value.forEach((k,v) ->
          {result.merge(k, v, (value1, value2) -> value1+value2)}
      return result.entrySet().stream().sorted(
        Comparator.comparingInt((entry) -> 
        entry -> [
          "messagesize": Integer.parseInt(entry.key),
          "messagecount": entry.value

The final output for our computation example has the following format:

  “messagesize”: 100,
  “messagecount”: 755,
  “messagesize”: 200,
  “messagesize”: 122

This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.

Creating a Vega-Lite visualization

The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.

A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:

  $schema: https://vega.github.io/schema/vega-lite/v2.json
  data: {
    name: table
    url: {
      index: vega-visu-blog-index
      %timefield%: timestamp
      %context%: true
      body: {
        aggs: {
          temporal_hist_agg: {
            scripted_metric: {
              init_script: {
                id: "initialize_variables"
              map_script: {
                id: map_documents
                params: {
                  bucketField: size_in_bytes.size
                  countField: size_in_bytes.freq
              combine_script: {
                id: "combine_documents"
              reduce_script: {
                id: "aggregate_histogram_buckets"
        size: 0
    format: {property: "aggregations.temporal_hist_agg.value"}
  mark: bar
  title: {
    text: Temporal Message Size Histogram
    frame: bounds
  encoding: {
    x: {
      field: messagesize
      type: ordinal
      axis: {
        title: "Message Size Bucket"
        format: "~s"
      bin: {
        binned: true,
        step: 10000
    y: {
      field: messagecount
      type: quantitative
      axis: {
        title: "Count"

Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.

To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.

Exploring and debugging scripted metric aggregation

The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:


The Vega Debugger shows the data as a tree, which you can easily explore.

The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.

Cleaning up

To delete all resources and stop incurring costs to your AWS account, complete the following steps:

  1. On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
  2. Choose Delete.

The process can take up to 15 minutes, and removes all the resources you deployed for following this post.


Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.


About the authors

Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.



Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.


Simplifying and modernizing home search at Compass with Amazon ES

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/simplifying-and-modernizing-home-search-at-compass-with-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and operate Elasticsearch in AWS at scale. It’s a widely popular service and different customers integrate it in their applications for different search use cases.

Compass rearchitected their search solution with AWS services, including Amazon ES, to deliver high-quality property searches and saved searches for their customers.

In this post, we learn how Compass’s search solution evolved, what challenges and benefits they found with different architectures, and how Amazon ES gives them a long-term scalable solution. We also see how Amazon Managed Streaming for Apache Kafka (Amazon MSK) helped create event-driven, real-time streaming capabilities of property listing data. You can apply this solution to similar use cases.

Overview of Amazon ES

Amazon ES makes it easy to deploy, operate, and scale Elasticsearch for log analytics, application monitoring, interactive search, and more. It’s a fully managed service that delivers the easy-to-use APIs and real-time capabilities of Elasticsearch along with the availability, scalability, and security required by real-world applications. It offers built-in integrations with other AWS services, including Amazon Kinesis, AWS Lambda, and Amazon CloudWatch, and third-party tools like Logstash and Kibana, so you can go from raw data to actionable insights quickly.

Amazon ES also has the following benefits:

  • Fully managed – Launch production-ready clusters in minutes. No more patching, versioning, and backups.
  • Access to all data – Capture, retain, correlate, and analyze your data all in one place.
  • Scalable – Resize your cluster with a few clicks or a single API call.
  • Secure – Deploy into your VPC and restrict access using security groups and AWS Identity and Access Management (IAM) policies.
  • Highly available – Replicate across Availability Zones, with monitoring and automated self-healing.
  • Tightly integrated – Seamless data ingestion, security, auditing, and orchestration.

Overview of Amazon MSK

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

Overview of Compass

Urban Compass, Inc. (Compass) operates as a global real estate technology company. The Company provides an online platform that supports buying, renting, and selling real estate assets.

In their own words, “Compass is building the first-of-its-kind modern real estate platform, pairing the industry’s top talent with technology to make the search and sell experience intelligent and seamless. Compass operates in over 24 markets, with over 2 billion in sales this year to date, with 2,300 employees and over 15,000 agents with a vision to find a place for everybody in the world.”

How Compass uses search to help customers

Search is one of Compass’s primary features, which enables its website visitors and agents to look for properties within the platform. The Compass platform has the following search components:

  • Search services – Uses Amazon ES extensively to power searches across hyperlocal real estate data (spanning thousands of attributes). This search component acquires data through Amazon MSK after it’s processed in Apache Spark and stored in Amazon Aurora PostgreSQL.
  • Agent and consumer search – A frontend built on top of search services, which works as an interface between agents, consumers, and the Compass search services. It’s built in React and enables you to seamlessly search real estate data and access hyper-local filters.
  • Saved search – As a consumer, when you run a search and save it, the saved search indexes are updated with your search parameters. When a new listing comes into the system (indexed through the listings Elasticsearch index), the Compass search identifies the saved searches that match the new listing using the percolate feature of Elasticsearch and notifies you of the new property listing.

Evolution of the consumer search

The following sections get into the details of the Compass primary search feature and how its architecture evolved over time, starting with its initial Apache Lucene architecture.

Previous architecture with Apache Lucene

Compass started implementing its search functionalities with direct integration with Lucene, where it was configured through manually provisioned AWS virtual machines and manual installation.

In the following architecture diagram, the MLS (Master Listing Service) system’s data gets pushed through a common extract, transform, and load (ETL) framework and is populated into the listing database. From there, the data gets pushed to the Lucene cluster to be queried by the frontend search REST APIs.

This architecture had different pain points that involved a lot of heavy lifting with Lucene, such as:

  • Manual maintenance and configuration required specialized knowledge
  • Challenges around disk forecasting and growth
  • Scalability and high performance achieved through manual sharding
  • Configuring new data fields was labor-intensive

New architecture with Amazon ES

To overcome these challenges, Compass started using Amazon ES.

The following architecture diagram shows the evolution of Compass’s system. Compass ran Amazon ES and Lucene in parallel, shadowing the traffic to verify and quality assurance the results in production. They ran the services side by side for two months, until they were confident that Amazon ES replicated the same results.

In addition, Compass added full text search (description search) immediately after switching to Amazon ES. As a result, listing and features are available faster to their customers, which allowed Compass to expand nationally and implement a localized search.

Enhanced architecture with Amazon MSK

Compass further enhanced its architecture with Amazon MSK, which enabled parallel processing by different teams that push transformed events to the Kafka cluster. The following diagram shows the enhanced architecture.

With the new architecture, Compass search saw some immediate wins:

  • Reduced maintenance costs because Amazon ES is a managed service and you don’t need to take on the overhead of cluster administration or management
  • Additional performance benefits because the index build time reduced from 8 hours to 1 hour
  • Ability to create or tear down clusters with just one click, which eased maintenance

Because of the Amazon ES user interface and monitoring capabilities, the Compass team could assess the usage pattern easily and perform capacity planning and cost prediction.

Implementing saved searches with the Elasticsearch percolator

The Elasticsearch percolator inverts the query-document usage pattern—you index queries instead of documents and query with documents against the indexed queries. The search results for a percolate query are the queries that match the document.

Compass used this feature to implement their saved search feature by indexing each user’s search queries. When a property listing arrives, it uses Amazon ES to retrieve the queries that match and notifies the respective customers.

The following diagram illustrates this workflow.

Before percolate, Compass reran saved search queries to inspect for new matches. Percolate allows them to notify users in a timely manner about changes in their searches. On average, Compass stores 250,000 search queries.

When new listings arrive, they submit an average of 5–10 bulk requests per minute, where each bulk request contains 1,000 documents. The maximum latency on the Amazon ES side varies from 750–2,500 milliseconds, with an 18-node cluster of m5.12xlarge instance types.

The following diagram shows the saved search architecture. Search criteria is saved in search indexes. When a new listing arrives through the MLS and Amazon MSK listing stream, it executes the percolate processor, which pushes a message to the Amazon MSK saved search match topic stream. Then it gets pulled by the saved search service, from which notifications are pushed to the end-user.

The architecture has the following benefits:

  • When you look for properties against a search criteria, it gets saved in Amazon ES. As agents add properties into the system that match the saved search criteria, you get notified that a new property has been added that matches your criteria.
  • Because of the percolate feature, you get notified as soon as a property is added into the system, which reduced the lag from 1 hour to 1 minute.
  • Before percolate, the Compass team had a batch job that pulled new property records against searches, but now they can use push-based notification.

Compass has a great growth path as their platform scales to more listings, agents, and users. They plan to develop the following features:

  • Real-time streaming
  • AI for search ranking
  • Personalized search through AI


In this post, we explained how Compass uses Amazon ES to bring their customers relevant results for their real estate needs. Whether you’re searching in real time for your next listing or using Compass’s saved search to monitor the market, Amazon ES delivers the results you need.

With the effort they saved from managing their Lucene infrastructure, Compass has focused on their business and engineering, which has opened up new opportunities for them.


About the Author

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives.

Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.