Tag Archives: Amazon Elasticsearch Service

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

  1. A Lambda function is invoked.
  2. AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
  3. AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
  4. The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:

JAVA_TOOL_OPTIONS=-Xlog:gc+metaspace,gc+heap,gc:stdout:time,tags

The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

  • A: the young generation
  • B: the old generation
  • C: the metaspace
  • D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

launch stack button

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
{
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"
}

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

  • In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
  • In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
  • The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
  • The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
  • The GC Duration show the sum of the duration of all displayed garbage collection activities.
  • The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags

The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

  • An S3 event from a new uploaded image triggers this function.
  • The function loads the image from S3, resizes it, and puts the resized version to S3.
  • The file size of the example image is close to 2.8MB.
  • The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)
[...]

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.

Conclusion

In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Analyzing Amazon S3 server access logs using Amazon ES

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/analyzing-amazon-s3-server-access-logs-using-amazon-es/

When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. With the advent of increased data storage needs, you can rely on Amazon S3 for a range of use cases and simultaneously looking for ways to analyze your logs to ensure compliance, perform the audit, and discover risks.

Amazon S3 lets you monitor the traffic using the server access logging feature. With server access logging, you can capture and monitor the traffic to your S3 bucket at any time, with detailed information about the source of the request. The logs are stored in the S3 bucket you own in the same Region. This addresses the security and compliance requirements of most organizations. The logs are critical for establishing baselines, analyzing access patterns, and identifying trends. For example, the logs could answer a financial organization’s question about how many requests are made to a bucket and who is making what type of access requests to the objects.

You can discover insights from server access logs through several different methods. One common option is by using Amazon Athena or Amazon Redshift Spectrum and query the log files stored in Amazon S3. However, this solution poses high latency with an exponential growth in volume. It requires further integration with Amazon QuickSight to add visualization capabilities.

You can address this by using Amazon Elasticsearch Service (Amazon ES). Amazon ES is a managed service that makes it easier to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with other AWS services such as Amazon S3 and Amazon Kinesis for loading streaming data into Amazon ES.

This post walks you through automating ingestion of server access logs from Amazon S3 into Amazon ES using AWS Lambda and visualizing the data in Kibana.

Architecture overview

Server access logging is enabled on source buckets, and logs are delivered to access log bucket. The access log bucket is configured to send an event to the Lambda function when a log file is created. On an event trigger, the Lambda function reads the file, processes the access log, and sends it to Amazon ES. When the logs are available, you can use Kibana to create interactive visuals and analyze the logs over a time period.

When designing a log analytics solution for high-frequency incoming data, you should consider buffering layers to avoid instability in the system. Buffering helps you streamline processes for unpredictable incoming log data. For such use cases, you can take advantage of managed services like Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Streaming services buffer data before delivering it to Amazon ES. This helps you avoid overwhelming your cluster with spiky ingestion events. Kinesis Data Firehose can reliably load data into Amazon ES. Kinesis Data Firehose lets you choose a buffer size of 1–100 MiBs and a buffer interval of 60–900 seconds when Amazon ES is selected as the destination. Kinesis Data Firehose also scales automatically to match the throughput of your data and requires no ongoing administration. For more information, see Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose.

The following diagram illustrates the solution architecture.

Prerequisites

Before creating resources in AWS CloudFormation, you must enable server access logging on the source bucket. Open the S3 bucket properties and look for Amazon S3 access and delivery bucket. See the following screenshot.

You also need an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. The user must have access to create IAM roles and policies via the CloudFormation template.

Setting up the resources with AWS CloudFormation

First, deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

  1. Sign in to the console and choose the Region of the bucket storing the access log. For this post, I use us-east-1.
  2. Launch the stack:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. On the Parameters page, enter the following parameters:
    1. VPC Configuration – Select any VPC that has at least two private subnets. The template deploys the Amazon ES service domain and Lambda within the VPC.
    2. Private subnets – Select two private subnets of the VPC. The route tables associated with subnets must have a NAT gateway configuration and VPC endpoint for Amazon S3 to privately connect the bucket from Lambda.
    3. Access log S3 bucket – Enter the S3 bucket where access logs are delivered. The template configures event notification on the bucket to trigger the Lambda function.
    4. Amazon ES domain name – Specify the Amazon ES domain name to be deployed through the template.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Acknowledge resource creation under Capabilities and transforms and choose Create.

The stack takes about 10–15 minutes to complete. The CloudFormation stack does the following:

  • Creates an Amazon ES domain with fine-grained access control enabled on it. Fine-grained access control is configured with a primary user in the internal user database.
  • Creates IAM role for the Lambda function with required permission to read from S3 bucket and write to Amazon ES.
  • Creates Lambda within the same VPC of Amazon ES elastic network interfaces (ENI). Amazon ES places an ENI in the VPC for each of your data nodes. The communication from Lambda to the Amazon ES domain is via this ENI.
  • Configures file create event notification on Access log S3 bucket to trigger the Lambda function. The function code segments are discussed in detail in this GitHub project.

You must make several considerations before you proceed with a production-grade deployment. For this post, I use one primary shard with no replicas. As a best practice, we recommend deploying your domain into three Availability Zones with at least two replicas. This configuration lets Amazon ES distribute replica shards to different Availability Zones than their corresponding primary shards and improves the availability of your domain. For more information about sizing your Amazon ES, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

We recommend setting the shard count based on your estimated index size, using 50 GB as a maximum target shard size. You should also define an index template to set the primary and replica shard counts before index creation. For more information about best practices, see Best practices for configuring your Amazon Elasticsearch Service domain.

For high-frequency incoming data, you can rotate indexes either per day or per week depending on the size of data being generated. You can use Index State Management to define custom management policies to automate routine tasks and apply them to indexes and index patterns.

Creating the Kibana user

With Amazon ES, you can configure fine-grained users to control access to your data. Fine-grained access control adds multiple capabilities to give you tighter control over your data. This feature includes the ability to use roles to define granular permissions for indexes, documents, or fields and to extend Kibana with read-only views and secure multi-tenant support. For more information on granular access control, see Fine-Grained Access Control in Amazon Elasticsearch Service.

For this post, you create a fine-grained role for Kibana access and map it to a user.

  1. Navigate to Kibana and enter the primary user credentials:
    1. User nameadminuser01
    2. Password[email protected]

To access Kibana, you must have access to the VPC. For more information about accessing Kibana, see Controlling Access to Kibana.

  1. Choose Security, Roles.
  2. For Role name, enter kibana_only_role.
  3. For Cluster-wide permissions, choose cluster_composite_ops_ro.
  4. For Index patterns, enter access-log and kibana.
  5. For Permissions: Action Groups, choose read, delete, index, and manage.
  6. Choose Save Role Definition.
  7. Choose Security, Internal User Database, and Create a New User.
  8. For Open Distro Security Roles, choose Kibana_only_role (created earlier).
  9. Choose Submit.

The user kibanauser01 now has full access to Kibana and access-logs indexes. You can log in to Kibana with this user and create the visuals and dashboards.

Building dashboards

You can use Kibana to build interactive visuals and analyze the trends and combine the visuals for different use cases in a dashboard. For example, you may want to see the number of requests made to the buckets in the last two days.

  1. Log in to Kibana using kibanauser01.
  2. Create an index pattern and set the time range
  3. On the Visualize section of your Kibana dashboard, add a new visualization.
  4. Choose Vertical Bar.

You can select any time range and visual based on your requirements.

  1. Choose the index pattern and then configure your graph options.
  2. In the Metrics pane, expand Y-Axis.
  3. For Aggregation, choose Count.
  4. For Custom Label, enter Request Count.
  5. Expand the X-Axis
  6. For Aggregation, choose Terms.
  7. For Field, choose bucket.
  8. For Order By, choose metric: Request Count.
  9. Choose Apply changes.
  10. Choose Add sub-bucket and expand the Split Series
  11. For Sub Aggregation, choose Date Histogram.
  12. For Field, choose requestdatetime.
  13. For Interval, choose Daily.
  14. Apply the changes by choosing the play icon at the top of the page.

You should see the visual on the right side, similar to the following screenshot.

You can combine graphs of different use cases into a dashboard. I have built some example graphs for general use cases like the number of operations per bucket, user action breakdown for buckets, HTTPS status rate, top users, and tabular formatted error details. See the following screenshots.

Cleaning up

Delete all the resources deployed through the CloudFormation template to avoid any unintended costs.

  1. Disable the access log on source bucket.
  2. On to the CloudFormation console, identify the stacks appropriately, and delete

Summary

This post detailed a solution to visualize and monitor Amazon S3 access logs using Amazon ES to ensure compliance, perform security audits, and discover risks and patterns at scale with minimal latency. To learn about best practices of Amazon ES, see Amazon Elasticsearch Service Best Practices. To learn how to analyze and create a dashboard of data stored in Amazon ES, see the AWS Security Blog.


About the Authors

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

 

 

 

 

Power data analytics, monitoring, and search use cases with the Open Distro for Elasticsearch SQL Engine on Amazon ES

Post Syndicated from Viraj Phanse original https://aws.amazon.com/blogs/big-data/power-data-analytics-monitoring-and-search-use-cases-with-the-open-distro-for-elasticsearch-sql-engine-on-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a popular choice for log analytics, search, real-time application monitoring, clickstream analysis, and more. One commonality among these use cases is the need to write and run queries to obtain search results at lightning speed. However, doing so requires expertise in the JSON-based Elasticsearch query domain-specific language (Query DSL). Although Query DSL is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

To solve this problem, we provided the Open Distro for Elasticsearch SQL Engine on Amazon ES, which we have been expanding since the initial release. The Structured Query Language (SQL) engine is powered by Open Distro for Elasticsearch, an Apache 2.0 licensed distribution of Elasticsearch. For more information about the Open Distro project, see Open Distro for Elasticsearch. For more information about the SQL engine capabilities, see SQL.

As part of this continued investment, we’re happy to announce new capabilities, including a Kibana-based SQL Workbench and a new SQL CLI that makes it even easier for Amazon ES users to use the Open Distro for Elasticsearch SQL Engine to work with their data.

SQL is the de facto standard for data and analytics and one of the most popular languages among data engineers and data analysts. Introducing SQL in Amazon ES allows you to manifest search results in a tabular format with documents represented as rows, fields as columns, and indexes as table names, respectively, in the WHERE clause. This acts as a straightforward and declarative way to represent complex DSL queries in a readable format. The newly added tools can act as a powerful yet simplified way to extract and analyze data, and can support complex analytics use cases.

Features overview

The following is a brief overview of the features of Open Distro for Elasticsearch SQL Engine on Amazon ES:

  • Query tools
    • SQL Workbench – A comprehensive and integrated visual tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. The following screenshot shows a query on the SQL Workbench page.

  • SQL CLI – An interactive, standalone command line tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. For following screenshot shows a query on the CLI.

  • Connectors and drivers
    • ODBC driver – The Open Database Connectivity (ODBC) driver enables connecting with business intelligence (BI) applications such as Tableau and exporting data to CSV and JSON.
    • JDBC driver – The Java Database Connectivity (JDBC) driver also allows you to connect with BI applications such as Tableau and export data to CSV and JSON.
  • Query support
    • Basic queries – You can use the SELECT clause, along with FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT to search and aggregate data.
    • Complex queries – You can perform complex queries such as subquery, join, and union on more than one Elasticsearch index.
    • Metadata queries – You can query basic metadata about Elasticsearch indexes using the SHOW and DESCRIBE commands.
  • Delete support
    • Delete – You can delete all the documents or documents that satisfy predicates in the WHERE clause from search results. However, it doesn’t delete documents from the actual Elasticsearch index.
  • JSON and full-text search support
    • JSON – Support for JSON by following PartiQL specification, a SQL-compatible query language, lets you query semi-structured and nested data for any data format.
    • Full-text search support – Full-text search on millions of documents is possible by letting you specify the full range of search options using SQL commands such as match and score.
  • Functions and operator support
    • Functions and operators – Support for string functions and operators, numeric functions and operators, and date-time functions is possible by enabling fielddata in the document mapping.
  • Settings
    • Settings – You can view, configure, and modify settings to control the behavior of SQL without needing to restart or bounce the Elasticsearch cluster.
  • Interfaces
    • Endpoints – The explain endpoint allows translating SQL into Query DSL, and the cursor helps obtain a paginated response for the SQL query result.
  • Monitoring
    • Monitoring – You can obtain node-level statistics by using the stats endpoint.
  • Request and response protocols

Conclusion

Open Distro for Elasticsearch SQL Engine on Amazon ES provides a comprehensive, flexible, and user-friendly set of features to obtain search results from Amazon ES in a declarative manner using SQL. For more information about querying with SQL, see SQL Support for Amazon Elasticsearch Service.

 


About the Author

Viraj Phanse (@vrphanse) is a product management leader at Amazon Web Services for Search Services/Analytics. Prior to AWS, he was in product management/strategy and go-to-market leadership roles at Oracle, Aerospike, INSZoom and Persistent Systems. He is a Fellow and Selection Committee member at Berkeley Angel Network, and a Big Data Advisory Board Member at San Francisco State University. He has completed his M.S. in Computer Science from UCLA and MBA from UC Berkeley’s Haas School of Business.

 

 

Creating customized Vega visualizations in Amazon Elasticsearch Service

Post Syndicated from Markus Bestehorn original https://aws.amazon.com/blogs/big-data/creating-customized-vega-visualizations-in-amazon-elasticsearch-service/

Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.

Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:

  • The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
  • Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
  • Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.

This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.

Use case

For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample:

207.46.13.136 - - [22/Jan/2019:03:56:19 +0330] "GET /product/14926 HTTP/1.1" 404 33617 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Each entry contains a source IP address (for the preceding data sample, 207.46.13.136), the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.

In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:

  • Requests that result in high bandwidth use
  • Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic

The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.

We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.

Overview of solution

To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.

The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.

Prerequisites and deploying your CloudFormation template

To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:

  1. Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
  1. Keep all the default parameters as is.
  2. Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  3. Choose Create Stack to start the deployment.

The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.

Connecting to the Kibana

After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:

  1. On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.

This opens the Kibana login prompt.

  1. Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:

Amazon Cognito requires you to change the password, which Kibana prompts you for.

  1. Enter a new password and a name for yourself.

Remember this password; there is no recovery option in this setup if you need to re-authenticate.

Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).

You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.

Creating an index pattern and exploring the data

Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:

  1. On the Kibana home screen, choose Discover.
  2. For Index Pattern, enter “vega*”.
  3. Choose Next step.

  1. For Time Filter field name, choose timestamp.
  2. Choose Create index pattern.

After a few seconds, a summary page with details about the created index pattern appears.

The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.

In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:

size_in_bytes: { "size": 13000, "freq": 11 }, …

The following screenshot shows both the visualization and raw data.

As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.

Vega on-the-fly data transformation

To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:

  1. Variable initialization
  2. Mapping documents to a key-value structure for the histogram
  3. Grouping values into arrays by keys
  4. Aggregation of array content into the histogram

To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:

#init_script
POST _scripts/initialize_variables
{
  "script": {
    "lang" : "painless",
    "source" : "state.test = [:]"
  }
}

Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).

To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:

{
  “timestamp”: “2019/01/26 12:58:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 369},
    {“size”: 200, “freq”: 62},
    … 
]}
{
  “timestamp”: “2019/01/26 12:57:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 386},
    {“size”: 200, “freq”: 60},
    … 
]}

The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.

#map_script
POST _scripts/map_documents
{
  "script": {
    "lang": "painless",
    "source": """
      if (doc.containsKey(params.bucketField))
        for (int i=0; i<doc[params.bucketField].length; i++)
        {
          def key = doc[params.bucketField][i];
          def value = doc[params.countField][i];
          state.test[(key).toString()] = value;
        }
        """
  }
}

Using our example input data results for this step, the following output is computed:

{“100”: 369, “200”: 62, …}
{“100”: 386, “200”: 60, …}

As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:

#combine_script
POST _scripts/combine_documents
{
  "script": {
    "lang": "painless",
    "source": "return state.test"
  }
}

With the example intermediate output from the previous step, the following intermediate output is computed:

{
  “100”: [369, 386],
  “200”: [62, 60], …
}

Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:

#reduce_script
POST _scripts/aggregate_histogram_buckets
{
  "script": {
    "lang": "painless",
    "source": """
      Map result = [:];
      states.forEach(value ->
        value.forEach((k,v) ->
          {result.merge(k, v, (value1, value2) -> value1+value2)}
        )
      );
      return result.entrySet().stream().sorted(
        Comparator.comparingInt((entry) -> 
        Integer.parseInt(entry.key))
      ).map(
        entry -> [
          "messagesize": Integer.parseInt(entry.key),
          "messagecount": entry.value
        ]).collect(Collectors.toList())
      """
  }
}

The final output for our computation example has the following format:

{
  “messagesize”: 100,
  “messagecount”: 755,
}
{
  “messagesize”: 200,
  “messagesize”: 122
}

This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.

Creating a Vega-Lite visualization

The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.

A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:

{
  $schema: https://vega.github.io/schema/vega-lite/v2.json
  data: {
    name: table
    url: {
      index: vega-visu-blog-index
      %timefield%: timestamp
      %context%: true
      body: {
        aggs: {
          temporal_hist_agg: {
            scripted_metric: {
              init_script: {
                id: "initialize_variables"
              }
              map_script: {
                id: map_documents
                params: {
                  bucketField: size_in_bytes.size
                  countField: size_in_bytes.freq
                }
              }
              combine_script: {
                id: "combine_documents"
                }
              reduce_script: {
                id: "aggregate_histogram_buckets"
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.temporal_hist_agg.value"}
  }
  mark: bar
  title: {
    text: Temporal Message Size Histogram
    frame: bounds
  },
  encoding: {
    x: {
      field: messagesize
      type: ordinal
      axis: {
        title: "Message Size Bucket"
        format: "~s"
      }
      bin: {
        binned: true,
        step: 10000
      }
    }
    y: {
      field: messagecount
      type: quantitative
      axis: {
        title: "Count"
        }
    }
  }
}

Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.

To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.

Exploring and debugging scripted metric aggregation

The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:

VEGA_DEBUG.view.data('table')

The Vega Debugger shows the data as a tree, which you can easily explore.

The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.

Cleaning up

To delete all resources and stop incurring costs to your AWS account, complete the following steps:

  1. On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
  2. Choose Delete.

The process can take up to 15 minutes, and removes all the resources you deployed for following this post.

Conclusion

Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.

 


About the authors

Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

 

 

Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.

 

Simplifying and modernizing home search at Compass with Amazon ES

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/simplifying-and-modernizing-home-search-at-compass-with-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and operate Elasticsearch in AWS at scale. It’s a widely popular service and different customers integrate it in their applications for different search use cases.

Compass rearchitected their search solution with AWS services, including Amazon ES, to deliver high-quality property searches and saved searches for their customers.

In this post, we learn how Compass’s search solution evolved, what challenges and benefits they found with different architectures, and how Amazon ES gives them a long-term scalable solution. We also see how Amazon Managed Streaming for Apache Kafka (Amazon MSK) helped create event-driven, real-time streaming capabilities of property listing data. You can apply this solution to similar use cases.

Overview of Amazon ES

Amazon ES makes it easy to deploy, operate, and scale Elasticsearch for log analytics, application monitoring, interactive search, and more. It’s a fully managed service that delivers the easy-to-use APIs and real-time capabilities of Elasticsearch along with the availability, scalability, and security required by real-world applications. It offers built-in integrations with other AWS services, including Amazon Kinesis, AWS Lambda, and Amazon CloudWatch, and third-party tools like Logstash and Kibana, so you can go from raw data to actionable insights quickly.

Amazon ES also has the following benefits:

  • Fully managed – Launch production-ready clusters in minutes. No more patching, versioning, and backups.
  • Access to all data – Capture, retain, correlate, and analyze your data all in one place.
  • Scalable – Resize your cluster with a few clicks or a single API call.
  • Secure – Deploy into your VPC and restrict access using security groups and AWS Identity and Access Management (IAM) policies.
  • Highly available – Replicate across Availability Zones, with monitoring and automated self-healing.
  • Tightly integrated – Seamless data ingestion, security, auditing, and orchestration.

Overview of Amazon MSK

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

Overview of Compass

Urban Compass, Inc. (Compass) operates as a global real estate technology company. The Company provides an online platform that supports buying, renting, and selling real estate assets.

In their own words, “Compass is building the first-of-its-kind modern real estate platform, pairing the industry’s top talent with technology to make the search and sell experience intelligent and seamless. Compass operates in over 24 markets, with over 2 billion in sales this year to date, with 2,300 employees and over 15,000 agents with a vision to find a place for everybody in the world.”

How Compass uses search to help customers

Search is one of Compass’s primary features, which enables its website visitors and agents to look for properties within the platform. The Compass platform has the following search components:

  • Search services – Uses Amazon ES extensively to power searches across hyperlocal real estate data (spanning thousands of attributes). This search component acquires data through Amazon MSK after it’s processed in Apache Spark and stored in Amazon Aurora PostgreSQL.
  • Agent and consumer search – A frontend built on top of search services, which works as an interface between agents, consumers, and the Compass search services. It’s built in React and enables you to seamlessly search real estate data and access hyper-local filters.
  • Saved search – As a consumer, when you run a search and save it, the saved search indexes are updated with your search parameters. When a new listing comes into the system (indexed through the listings Elasticsearch index), the Compass search identifies the saved searches that match the new listing using the percolate feature of Elasticsearch and notifies you of the new property listing.

Evolution of the consumer search

The following sections get into the details of the Compass primary search feature and how its architecture evolved over time, starting with its initial Apache Lucene architecture.

Previous architecture with Apache Lucene

Compass started implementing its search functionalities with direct integration with Lucene, where it was configured through manually provisioned AWS virtual machines and manual installation.

In the following architecture diagram, the MLS (Master Listing Service) system’s data gets pushed through a common extract, transform, and load (ETL) framework and is populated into the listing database. From there, the data gets pushed to the Lucene cluster to be queried by the frontend search REST APIs.

This architecture had different pain points that involved a lot of heavy lifting with Lucene, such as:

  • Manual maintenance and configuration required specialized knowledge
  • Challenges around disk forecasting and growth
  • Scalability and high performance achieved through manual sharding
  • Configuring new data fields was labor-intensive

New architecture with Amazon ES

To overcome these challenges, Compass started using Amazon ES.

The following architecture diagram shows the evolution of Compass’s system. Compass ran Amazon ES and Lucene in parallel, shadowing the traffic to verify and quality assurance the results in production. They ran the services side by side for two months, until they were confident that Amazon ES replicated the same results.

In addition, Compass added full text search (description search) immediately after switching to Amazon ES. As a result, listing and features are available faster to their customers, which allowed Compass to expand nationally and implement a localized search.

Enhanced architecture with Amazon MSK

Compass further enhanced its architecture with Amazon MSK, which enabled parallel processing by different teams that push transformed events to the Kafka cluster. The following diagram shows the enhanced architecture.

With the new architecture, Compass search saw some immediate wins:

  • Reduced maintenance costs because Amazon ES is a managed service and you don’t need to take on the overhead of cluster administration or management
  • Additional performance benefits because the index build time reduced from 8 hours to 1 hour
  • Ability to create or tear down clusters with just one click, which eased maintenance

Because of the Amazon ES user interface and monitoring capabilities, the Compass team could assess the usage pattern easily and perform capacity planning and cost prediction.

Implementing saved searches with the Elasticsearch percolator

The Elasticsearch percolator inverts the query-document usage pattern—you index queries instead of documents and query with documents against the indexed queries. The search results for a percolate query are the queries that match the document.

Compass used this feature to implement their saved search feature by indexing each user’s search queries. When a property listing arrives, it uses Amazon ES to retrieve the queries that match and notifies the respective customers.

The following diagram illustrates this workflow.

Before percolate, Compass reran saved search queries to inspect for new matches. Percolate allows them to notify users in a timely manner about changes in their searches. On average, Compass stores 250,000 search queries.

When new listings arrive, they submit an average of 5–10 bulk requests per minute, where each bulk request contains 1,000 documents. The maximum latency on the Amazon ES side varies from 750–2,500 milliseconds, with an 18-node cluster of m5.12xlarge instance types.

The following diagram shows the saved search architecture. Search criteria is saved in search indexes. When a new listing arrives through the MLS and Amazon MSK listing stream, it executes the percolate processor, which pushes a message to the Amazon MSK saved search match topic stream. Then it gets pulled by the saved search service, from which notifications are pushed to the end-user.

The architecture has the following benefits:

  • When you look for properties against a search criteria, it gets saved in Amazon ES. As agents add properties into the system that match the saved search criteria, you get notified that a new property has been added that matches your criteria.
  • Because of the percolate feature, you get notified as soon as a property is added into the system, which reduced the lag from 1 hour to 1 minute.
  • Before percolate, the Compass team had a batch job that pulled new property records against searches, but now they can use push-based notification.

Compass has a great growth path as their platform scales to more listings, agents, and users. They plan to develop the following features:

  • Real-time streaming
  • AI for search ranking
  • Personalized search through AI

Conclusion

In this post, we explained how Compass uses Amazon ES to bring their customers relevant results for their real estate needs. Whether you’re searching in real time for your next listing or using Compass’s saved search to monitor the market, Amazon ES delivers the results you need.

With the effort they saved from managing their Lucene infrastructure, Compass has focused on their business and engineering, which has opened up new opportunities for them.

 


About the Author

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives.

Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service

Post Syndicated from Chris Swierczewski original https://aws.amazon.com/blogs/big-data/using-random-cut-forests-for-real-time-anomaly-detection-in-amazon-elasticsearch-service/

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon Elasticsearch Service and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm. In particular, we:

  • Discuss the goals of anomaly detection
  • Share how to use the RCF algorithm to detect anomalies and why we chose RCF for this tool
  • Interpret the output of anomaly detection for Elasticsearch
  • Compare the results of the anomaly detector to commonly used methods

What is anomaly detection?

Human beings have excellent intuition and can detect when something is out of order. Often, an anomaly or outlier can appear so obvious you just “know it when you see it.” However, you can’t base computational approaches to anomaly detection on such intuition; you must found them on mathematical definitions of anomalies.

The mathematical definition of an anomaly is varied and typically addresses the notion of separation from normal observation. This separation can manifest in several ways via multiple definitions. One common definition is “a data point lying in a low-density region.” As you track a data source, such as total bytes transferred from a particular IP address, number of logins on a given website, or number of sales per minute of a particular product, the raw values describe some probability or density distribution. A high-density region in this value distribution is an area of the domain where a data point is highly likely to exist. A low-density region is where data tends not to appear. For more information, see Anomaly detection: A survey.

For example, the following image shows two-dimensional data with a contour map indicating the density of the data in that region.

The data point in the bottom-right corner of the image occurs in a low-density region and, therefore, is considered anomalous. This doesn’t necessarily mean that an anomaly is something bad. Rather, under this definition, you can describe an anomaly as behavior that rarely occurs or is outside the normal scope of behavior.

Random Cut Forests and anomaly thresholding

The algorithmic core of the anomaly detection feature consists of two main components:

  • A RCF model for estimating the density of an input data stream
  • A thresholding model for determining if a point should be labeled as anomalous

You can use the RCF algorithm to summarize a data stream, including efficiently estimating its data density, and convert the data into anomaly scores. Anomaly scores are positive real numbers such that the larger the number, the more anomalous the data point. For more information, see Real Time Anomaly Detection in Open Distro for Elasticsearch.

We chose RCF for this plugin for several reasons:

  • Streaming context – Elasticsearch feature queries are streaming, in that the anomaly detector only receives each new feature aggregate one at a time.
  • Expensive queries – Especially on a large cluster, each feature query may be costly in CPU and memory resources. This limits the amount of historical data we can obtain for model training and initialization.
  • Customer hardware – Our anomaly detection plugin runs on the same hardware as our customers’ Elasticsearch cluster. Therefore, we must be mindful of our plugin’s CPU and memory impact.
  • Scalable – It is preferred if you can distribute the work required to determine anomalous data across the nodes in the cluster.
  • Unlabeled data – Even for training purposes, we don’t have access to labeled data. Therefore, the algorithm must be unsupervised.

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

Understanding the output

The anomaly detector outputs two values: an anomaly grade and a confidence score. The anomaly grade is a measurement of the severity of an anomaly on a scale from zero to one. A zero anomaly grade indicates that the corresponding data point is normal. Any non-zero grade means that the anomaly score output by RCF exceeds the calculated score threshold, and therefore indicates the presence of an anomaly. Using the mathematical definition introduced at the beginning of this post, the grade is inversely related to the anomaly’s density; that is, the rarer the event, the higher the corresponding anomaly grade.

The confidence score is a measurement of the probability that the anomaly detection model correctly reports an anomaly within the algorithm’s inherent error bounds. We derive the model confidence from three sources:

  • A statistical measurement of whether the RCF model has observed enough data. As the RCF model observes more data, this source of confidence approaches 100%.
  • A confidence upper bound comes from the approximation made by the distribution sketch in the thresholding model. The KLL algorithm can only predict the score threshold within particular error bounds with a certain probability.
  • A confidence measurement attributed to each node of the Elasticsearch cluster. If a node is lost, the corresponding model data is also lost, which leads to a temporary confidence drop.

The NYC Taxi dataset

We demonstrate the effectiveness of the new anomaly detector feature on the New York City taxi passenger dataset. This data contains 6 months of taxi ridership volume from New York City aggregated into 30-minute windows. Thankfully, the dataset comes with labels in the form of anomaly windows, which indicate a period of time when an anomalous event is known to occur. Example known events in this dataset are the New York City marathon, where taxi ridership uncharacteristically spiked shortly after the event ended, and the January 2015 North American blizzard, when at one point the city ordered all non-essential vehicles off the streets, which resulted in a significant drop in taxi ridership.

We compare the results of our anomaly detector to two common approaches for detecting anomalies: a rules-based approach and the Gaussian distribution method.

Rules-based approach

In a rules-based approach, you mark a data point as anomalous if it exceeds a preset, human-specified boundary. This approach requires significant domain knowledge of the incoming data and can break down if the data has any upward or downward trends.

The following graph is a plot of the NYC taxi dataset with known anomalous event periods indicated by shaded regions. The model’s anomaly detection output is shown in red below the taxi ridership values. A set of human labelers received the first month of data (1,500 data points) to define anomaly detection rules. Half of the participants responded by stating they didn’t have sufficient information to confidently define such rules. From the remaining responses, the consolidated rules for anomalies are either that the value is equal to or greater than 30,000, or the value is below 20,000 for 150 points (about three days).

In this use case, the human annotators do a good enough job in this particular range of data. However, this approach to anomaly detection doesn’t scale well and may require a large amount of training data before a human can set reasonable thresholds that don’t suffer from a high false positive or high false negative rate. Additionally, as mentioned earlier, if this data develops an upward or downward trend, our team of annotators needs to revisit these constant-value thresholds.

Gaussian distribution method

A second common approach is to fit a Gaussian distribution to the data and define an anomaly as any value that is three standard deviations away from the mean. To improve the model’s ability to adapt to new information, the distribution is typically fit on a sliding window of the observations. Here, we determine the mean and standard deviation from the 1,500 most recent data points and use these to make predictions on the current value. See the following graph.

The Gaussian model detects the clear ridership spike at the marathon but isn’t robust enough to capture the other anomalies. In general, such a model can’t capture certain kinds of temporal anomalies where, for example, there’s a sudden spike in the data or other change in behavior that is still within the normal range of values.

Anomaly detection tool

Finally, we look at the anomaly detection results from the anomaly detection tool. The taxi data is streamed into an RCF model that estimates the density of the data in real time. The RCF sends these anomaly scores to the thresholding model, which decides whether the corresponding data point is anomalous. If so, the model reports the severity of the anomaly in the anomaly grade. See the following graph.

Five out of seven of the known anomalous events are successfully detected with zero false positives. Furthermore, with our definition of anomaly grade, we can indicate which anomalies are more severe than others. For example, the NYC Marathon spike is much more severe than those of Labor Day and New Year’s Eve. Based on the definition of an anomaly in terms of data density, the behavior observed at the NYC Marathon lives in a very low-density region, whereas, by the time we see the New Year’s Eve spike, this kind of behavior is still rare but not as rare anymore.

Summary

In this post, you learned about the goals of anomaly detection and explored the details of the model and output of the anomaly detection feature, now available in Amazon ES and Open Distro for Elasticsearch. We also compared the results of the anomaly detection tool to two common models and observed considerable performance improvement.


About the Authors

Chris Swierczewski is an applied scientist in Amazon AI. He enjoys hiking and backpacking with his family.

 

 

 

 

 

Lai Jiang is a software engineer working on machine learning and Elasticsearch at Amazon Web Services. His primary interests are algorithms and math. He is an active contributor to Open Distro for Elasticsearch.

 

 

 

Moving to managed: The case for the Amazon Elasticsearch Service

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/moving-to-managed-the-case-for-amazon-elasticsearch-service/

Prior to joining AWS, I led a development team that built mobile advertising solutions with Elasticsearch. Elasticsearch is a popular open-source search and analytics engine for log analytics, real-time application monitoring, clickstream analysis, and (of course) search. The platform I was responsible for was essential to driving my company’s business.

My team ran a self-managed implementation of Elasticsearch on AWS. At the time, a managed Elasticsearch offering wasn’t available. We had to build the scripting and tooling to deploy our Elasticsearch clusters across three geographical regions. This included the following tasks (and more):

  • Configuring the networking, routing, and firewall rules to enable the cluster to communicate.
  • Securing Elasticsearch management APIs from unauthorized access.
  • Creating load balancers for request distribution across data nodes.
  • Creating automatic scaling groups to replace the instances if there were issues.
  • Automating the configuration.
  • Managing the upgrades for security concerns.

If I had one word to describe the experience, it would be “painful.” Deploying and managing your own Elasticsearch clusters at scale takes a lot of time and knowledge to do it properly. Perhaps most importantly, it took my engineers away from doing what they do best—innovating and producing solutions for our customers.

Amazon Elasticsearch Service (Amazon ES) launched on October 1, 2015, almost 2 years after I joined AWS. Almost 5 years later, Amazon ES is in the best position ever to provide you with a compelling feature set that enables your search-related needs. With Amazon ES, you get a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters securely and cost-effectively in the AWS Cloud. Amazon ES offers direct access to the Elasticsearch APIs, which makes your existing code and applications using Elasticsearch work seamlessly with the service.

Amazon ES provisions all the resources for your Elasticsearch cluster and launches it in any Region of your choice within minutes. It automatically detects and replaces failed Elasticsearch nodes, which reduces the overhead associated with self-managed infrastructures. You can scale your cluster horizontally or vertically, up to 3 PB of data, with zero downtime through a single API call or a few clicks on the AWS Management Console. With this flexibility, Amazon ES can support any workload from single-node development clusters to production-scale, multi-node clusters.

Amazon ES also provides a robust set of Kibana plugins, free of any licensing fees. Features like fine-grained access control, alerting, index state management, and SQL support are just a few of the many examples. The Amazon ES feature set originates from the needs of customers such as yourself and through open-source community initiatives such as Open Distro for Elasticsearch (ODFE).

You need to factor several considerations into your decision to move to a managed service. Obviously, you want your teams focused on doing meaningful work that propels the growth of your company. Deciding what processes you offload to a managed service versus what are best self-managed can be a challenge. Based on my experience managing Elasticsearch at my prior employer, and having worked with thousands of customers who have migrated to AWS, I consider the following sections important topics for you to review.

Workloads

Before migrating to a managed service, you might look to what others are doing in their “vertical,” whether it be in finance, telecommunications, legal, ecommerce, manufacturing, or any number of other markets. You can take comfort in knowing that thousands of customers across these verticals successfully deploy their search, log analytics, SIEM, and other workloads on Amazon ES.

Elasticsearch is by default a search engine. Compass uses Amazon ES to scale their search infrastructure and build a complete, scalable, real estate home-search solution. By using industry-leading search and analytical tools, they make every listing in the company’s catalog discoverable to consumers and help real estate professionals find, market, and sell homes faster.

With powerful tools such as aggregations and alerting, Elasticsearch is widely used for log analytics workloads to gain insights into operational activities. As Intuit moves to a cloud hosting architecture, the company is on an “observability” journey to transform the way it monitors its applications’ health. Intuit used Amazon ES to build an observability solution, which provides visibility to its operational health across the platform, from containers to serverless applications.

When it comes to security, Sophos is a worldwide leader in next-generation cybersecurity, and protects its customers from today’s most advanced cyber threats. Sophos developed a large-scale security monitoring and alerting system using Amazon ES and other AWS components because they know Amazon ES is well suited for security use cases at scale.

Whether it be finding a home, detecting security events, or assisting developers with finding issues in applications, Amazon ES supports a wide range of use cases and workloads.

Cost

Any discussion around operational best practices has to factor in cost. With Amazon ES, you can select the optimal instance type and storage option for your workload with a few clicks on the console. If you’re uncertain of your compute and storage requirements, Amazon ES has on-demand pricing with no upfront costs or long-term commitments. When you know your workload requirements, you can lock in significant cost savings with Reserved Instance pricing for Amazon ES.

Compute and infrastructure costs are just one part of the equation. At AWS, we encourage customers to evaluate their Total Cost of Ownership (TCO) when comparing solutions. As an organizational decision-maker, you have to consider all the related cost benefits when choosing to replace your self-managed environment. Some of the factors I encourage my customers to consider are:

  • How much are you paying to manage the operation of your cluster 24/7/365?
  • How much do you spend on building the operational components, such as support processes, as well as automated and manual remediation procedures for clusters in your environment?
  • What are the license costs for advanced features?
  • What costs do you pay for networking between the clusters or the DNS services to expose your offerings?
  • How much do you spend on backup processes and how quickly can you recover from any failures?

The beauty of Amazon ES is that you do not need to focus on these issues. Amazon ES provides operational teams to manage your clusters, automated hourly backups of data for 14 days for your cluster, automated remediation of events with your cluster, and incremental license-free features as one of the basic tenants of the service.

You also need to pay particular close attention to managing the cost of storing your data in Elasticsearch. In the past, to keep storage costs from getting out of control, self-managed Elasticsearch users had to rely on solutions that were complicated to manage across data tiers and in some cases didn’t give you quick access to that data. AWS solved this problem with UltraWarm, a new, low-cost storage tier. UltraWarm lets you store and interactively analyze your data, backed by Amazon Simple Storage Service (Amazon S3) using Elasticsearch and Kibana, while reducing your cost per GB by almost 90% over existing hot storage options.

Security

In my conversations with customers, their primary concern is security. One data breech can cost millions and forever damage a company’s reputation. Providing you with the tools to secure your data is a critical component of our service. For your data in Amazon ES, you can do the following:

Many customers want to have a single sign-on environment when integrating with Kibana. Amazon ES offers Amazon Cognito authentication for Kibana. You can choose to integrate identity providers such as AWS Single Sign-On, PingFederate, Okta, and others. For more information, see Integrating Third-Party SAML Identity Providers with Amazon Cognito User Pools.

Recently, Amazon ES introduced fine-grained access control (FGAC). FGAC provides granular control of your data on Amazon ES. For example, depending on who makes the request, you might want a search to return results from only one index. You might want to hide certain fields in your documents or exclude certain documents altogether. FGAC gives you the power to control who sees what data exists in your Amazon ES domain.

Compliance

Many organizations need to adhere to a number of compliance standards. Those that have experienced auditing and certification activities know that ensuring compliance is an expensive, complex, and long process. However, by using Amazon ES, you benefit from the work AWS has done to ensure compliance with a number of important standards. Amazon ES is PCI DSS, SOC, ISO, and FedRamp compliant to help you meet industry-specific or regulatory requirements. Because Amazon ES is a HIPAA-eligible service, processing, storing and transmitting PHI can help you accelerate development of these sensitive workloads.

Amazon ES is part of the services in scope of the most recent assessment. You can build solutions on top of Amazon ES with the knowledge that independent auditors acknowledge that the service meets the bar for these important industry standards.

Availability and resiliency

When you build an Elasticsearch deployment either on premises or in cloud environments, you need to think about how your implementation can survive failures. You also need to figure out how you can recover from failures when they occur. At AWS, we like to plan for the fact that things do break, such as hardware failures and disk failures, to name a few.

Unlike virtually every other technology infrastructure provider, each AWS Region has multiple Availability Zones. Each Availability Zone consists of one or more data centers, physically separated from one another, with redundant power and networking. For high availability and performance of your applications, you can deploy applications across multiple Availability Zones in the same Region for fault tolerance and low latency. Availability Zones interconnect with fast, private fiber-optic networking, which enables you to design applications that automatically fail over between Availability Zones without interruption. Availability Zones are more highly available, fault tolerant and scalable than traditional single or multiple data center infrastructures.

Amazon ES offers you the option to deploy your instances across one, two, or three AZs. If you’re running development or test workloads, pick the single-AZ option. Those running production-grade workloads should use two or three Availability Zones.

For information, see Increase availability for Amazon Elasticsearch Service by Deploying in three Availability Zones. Additionally, deploying in multiple Availability Zones with dedicated master nodes means that you get the benefit of the Amazon ES SLA.

Operations

A 24/7/365 operational team with experience managing thousands of Elasticsearch clusters around the globe monitors Amazon ES. If you need support, you can get expert guidance and assistance across technologies from AWS Support to achieve your objectives faster at lower costs. I want to underline the importance of having a single source for support for your cloud infrastructure. Amazon ES doesn’t run in isolation, and having support for your entire cloud infrastructure from a single source greatly simplifies the support process. AWS also provides you with the option to use Enterprise level support plans, where you can have a dedicated technical account manager who essentially becomes a member of your team and is committed to your success with AWS.

Using tools on Amazon ES such as alerting, which provides you with a means to take action on events in your data, and index state management, which enables you to automate activities like rolling off logs, gives you additional operation features that you don’t need to build.

When it comes to monitoring your deployments, Amazon ES provides you with a wealth of Amazon CloudWatch metrics with which you can monitor all your Amazon ES deployments within a “single pane of glass.” For more information, see Monitoring Cluster Metrics with Amazon CloudWatch.

Staying current is another important topic. To enable access to newer versions of Elasticsearch and Kibana, Amazon ES offers in-place Elasticsearch upgrades for domains that run versions 5.1 and later. Amazon ES provides access to the most stable and current versions in the open-source community as long as the distribution passes our stringent security evaluations. Our service prides itself on the fact that we offer you a version that has passed our own internal AWS security reviews.

AWS integrations and other benefits

AWS has a broad range of services that seamlessly integrate with Amazon ES. Like many customers, you may want to monitor the health and performance of your native cloud services on AWS. Most AWS services log events into Amazon CloudWatch Logs. You can configure a log group to stream data it receives to your Amazon ES domain through a CloudWatch Logs subscription.

The volume of log data can be highly variable, and you should consider buffering layers when operating at a large scale. Buffering allows you to design stability into your processes. When designing for scale, this is one of easiest ways I know to avoid overwhelming your cluster with spikey ingestion events. Amazon Kinesis Data Firehose has a direct integration with Amazon ES and offers buffering and retries as part of the service. You configure Amazon ES as a destination through a few simple settings, and data can begin streaming to your Amazon ES domain.

Increased speed and agility

When building new products and tuning existing solutions, you need to be able to experiment. As part of that experimentation, failing fast is an accepted process that gives your team the ability to try new approaches to speed up the pace of innovation. Part of that process involves using services that allow you to create environments quickly and, should the experiment fail, start over with a new approach or use different features that ultimately allow you to achieve the desired results.

With Amazon ES, you receive the benefit of being able to provision an entire Elasticsearch cluster, complete with Kibana, on the “order of minutes” in a secure, managed environment. If your testing doesn’t produce the desired results, you can change the dimensions of your cluster horizontally or vertically using different instance offerings within the service via a single API call or a few clicks on the console.

When it comes to deploying your environment, native tools such as Amazon CloudFormation provide you with deployment tooling that gives you the ability to create entire environments through configuration scripting via JSON or YAML. The AWS Command Line Interface (AWS CLI) provides command line tooling that also can spin up domains with a small set of commands. For those who want to ride the latest wave in scripting their environments, the AWS CDK has a module for Amazon ES.

Conclusion

Focusing your teams on doing important, innovative work creating products and services that differentiate your company is critical. Amazon ES is an essential tool to provide operational stability, security, and performance of your search and analytics infrastructure. When you consider the following benefits Amazon ES provides, the decision to migrate is simple:

  • Support for search, log analytics, SIEM, and other workloads
  • Innovative functionality using UltraWarm to help you manage your costs
  • Highly secure environments that address PCI and HIPAA workloads
  • Ability to offload operational processes to an experienced provider that knows how to operate Elasticsearch at scale
  • Plugins at no additional cost that provide fine-grained access, vector-based similarity algorithms, or alerting and monitoring with the ability to automate incident response.

You can get started using Amazon ES with the AWS Free Tier. This tier provides free usage of up to 750 hours per month of a t2.small.elasticsearch instance and 10 GB per month of optional EBS storage (Magnetic or General Purpose).

Over the course of the next few months, I’ll be co-hosting a series of posts that introduce migration patterns to help you move to Amazon ES. Additionally, AWS has a robust partner ecosystem and a professional services team that provide you with skilled and capable people to assist you with your migration.

 


About the Author

Kevin Fallis (@AWSCodeWarrior) is an AWS specialist search solutions architect.His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

 

Best practices for configuring your Amazon Elasticsearch Service domain

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/best-practices-for-configuring-your-amazon-elasticsearch-service-domain/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy to deploy, secure, scale, and monitor your Elasticsearch cluster in the AWS Cloud. Elasticsearch is a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon ES domains.

The most important practice is to iterate. If you follow these best practices, you can plan for a baseline Amazon ES deployment. Elasticsearch behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.

Deploying Amazon ES

Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via Amazon ES APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.

When you configure your Amazon ES domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch is a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability.

Amazon ES supports five instance classes: M, R, I, C, and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the M5, R5, I3, C5, and T2.

Choosing your instance type for data nodes

When choosing an instance type for your data nodes, bear in mind that these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). As a best practice, for heavy production workloads, choose the R5 or I3 instance type. If your emphasis is primarily on performance, the R5 typically delivers the best performance for log analytics workloads, and often for search workloads. The I3 instances are strong contenders and may suit your workload better, so you should test both. If your emphasis is on cost, the I3 instances have better cost efficiency at scale, especially if you choose to purchase reserved instances.

For an entry-level instance or a smaller workload, choose the M5s. The C5s are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T2 instances for development or QA workloads, but not for production. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

Choosing your instance type for dedicated master nodes

When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. The C5 instances work best as dedicated masters up to about 75 data node clusters. Above that node count, you should choose R5.

Choosing Availability Zones

Amazon ES makes it easy to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones. As a best practice, choose three Availability Zones for your production deployments.

When you choose more than one Availability Zone, Amazon ES deploys data nodes equally across the zones and makes sure that replicas go into different zones. Additionally, when you choose more than one Availability Zone, Amazon ES always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.

Elasticsearch index and shard design

When you use Amazon ES, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.

Amazon ES partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.

Index patterns

For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.

Setting your shard counts

There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Elasticsearch creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Elasticsearch adjusts accordingly by creating or removing replicas.

You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for log analytics is to set an index template. See the following code:

PUT _template/<template_name>
{
    "index_patterns": ["logs-*"],
    "settings": {
        "number_of_shards": 10,
        "number_of_replicas": 1
    },
    "mappings": {
        …
    }
}

When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.

You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.

For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.

You should try to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default settings for Elasticsearch versions 6 and below are 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed equally. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.

Determining storage size

So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.

You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.

The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M5 or R5, you can deploy EBS volumes up to the supported limit. For more information, see Amazon Elasticsearch Service Limits.

For instances with ephemeral store, storage is limited by the instance type (for example, I3.8xlarge.elasticsearch has 7.8 TB of attached storage). If you choose EBS, you should use the general purpose, GP2, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.

Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.

After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.

Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.

As you work with shard and instance counts, bear in mind that Amazon ES works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R5.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.

Use case

Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R5 instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.

The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R5.xlarge.elasticsearch, 9x R5.2xlarge.elasticsearch, or 6x R5.4xlarge.elasticsearch instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).

Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R5.4xlarge.elasticsearch has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R5.4xlarge.elasticsearch, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x C5.large.elasticsearch dedicated master nodes.

Conclusion

This post covered some of the core best practices for deploying your Amazon ES domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.

 


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

General Availability of UltraWarm for Amazon Elasticsearch Service

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/general-availability-of-ultrawarm-for-amazon-elasticsearch-service/

Today, we are happy to announce the general availability of UltraWarm for Amazon Elasticsearch Service.

This new low-cost storage tier provides fast, interactive analytics on up to three petabytes of log data at one-tenth of the cost of the current Amazon Elasticsearch Service storage tier.

UltraWarm, complements the existing Amazon Elasticsearch Service hot storage tier by providing less expensive storage for older and less-frequently accessed data while still ensuring that snappy, interactive experience that Amazon Elasticsearch Service customers have come to expect. Amazon Elasticsearch Service stores data in Amazon S3 while using custom, highly-optimized nodes, purpose-built on the AWS Nitro System, to cache, pre-fetch, and query that data.

There are many use cases for the Amazon Elasticsearch Service, from building a search system for your website, storing, and analyzing data from application or infrastructure logs. We think this new storage tier will work particularly well for customers that have large volumes of log data.

Amazon Elasticsearch Service is a popular service for log analytics because of its ability to ingest high volumes of log data and analyze it interactively. As more developers build applications using microservices and containers, there is an explosive growth of log data. Storing and analyzing months or even years worth of data is cost-prohibitive at scale, and this has led customers to use multiple analytics tools, or delete valuable data, missing out on essential insights that the longer-term data could yield.

AWS built UltraWarm to solve this problem and ensures that developers, DevOps engineers, and InfoSec experts can analyze recent and longer-term operational data without needing to spend days restoring data from archives to an active searchable state in an Amazon Elasticsearch Service cluster.

So let’s take a look at how you use this new storage tier by creating a new domain in the AWS Management Console.

Firstly, I go to the Amazon Elasticsearch Service console and click on the button to Create a new domain. This then takes me through a workflow to set up a new cluster, for the most part setting up a new domain with UltraWarm is identical to setting up a regular domain, I will point out the couple of things you will need to do differently.

On Step 1 of the workflow, I click on the radio button to create a Production deployment type and click Next.

I continue to fill out the configuration in Step 2. Then, near the end, I check the box to Enable UltraWarm data nodes and select the Instance type I want to use. I go with the default ultrawarm1.medium.elasticsearch and then ask for 3 of them, there is a requirement to have at least 2 nodes.

Everything else about the setup is identical to a regular Amazon Elasticsearch Service setup. After having set up the cluster, I then go to the dashboard and select my newly created domain. The dashboard confirms that my newly created domain has 3 UltraWarm data nodes, each with 1516 (GiB) free storage space.

As well as using UltraWarm on a new domain, you can also enable it for existing domains using the AWS Management Console, CLI, or SDK.

Once you have UltraWarm nodes setup, you can migrate an index from hot to warm by using the following request.

POST _ultrawarm/migration/my-index/_warm

You can then check the status of the migration by using the following request.

GET _ultrawarm/migration/my-index/_status
{
  "migration_status": {
    "index": "my-index",
    "state": "RUNNING_SHARD_RELOCATION",
    "migration_type": "HOT_TO_WARM",
    "shard_level_status": {
      "running": 0,
      "total": 5,
      "pending": 3,
      "failed": 0,
      "succeeded": 2
    }
  }
}

UltraWarm is available today on Amazon Elasticsearch Service version 6.8 and above in 22 regions globally.

Happy Searching

— Martin

Creating a searchable enterprise document repository

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/creating-a-searchable-enterprise-document-repository/

Enterprise customers frequently have repositories with thousands of documents, images and other media. While these contain valuable information for users, it’s often hard to index and search this content. One challenge is interpreting the data intelligently, and another issue is processing these files at scale.

In this blog post, I show how you can deploy a serverless application that uses machine learning to interpret your documents. The architecture includes a queueing mechanism for handling large volumes, and posting the indexing metadata to an Amazon Elasticsearch domain. This solution is scalable and cost effective, and you can modify the functionality to meet your organization’s requirements.

The overall architecture for a searchable document repository solution.

The application takes unstructured data and applies machine learning to extract context for a search service, in this case Elasticsearch. There are many business uses for this design. For example, this could help Human Resources departments in searching for skills in PDF resumes. It can also be used by Marketing departments or creative groups with a large number of images, providing a simple way to search for items within the images.

As documents and images are added to the S3 bucket, these events invoke AWS Lambda functions. This runs custom code to extract data from the files, and also calls Amazon ML services for interpretation. For example, when adding a resume, the Lambda function extracts text from the PDF file while Amazon Comprehend determines the key phrases and topics in the document. For images, it uses Amazon Rekognition to determine the contents. In both cases, once it identifies the indexing attributes, it saves the data to Elasticsearch.

The code uses the AWS Serverless Application Model (SAM), enabling you to deploy the application easily in your own AWS Account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost for large data imports. Additionally, it requires an Elasticsearch domain, which may incur cost on your AWS bill.

To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.

Creating an Elasticsearch domain

This application requires an Elasticsearch development domain for testing purposes. To learn more about production configurations, see the Elasticsearch documentation.

To create the test domain:

  1. Navigate to the Amazon Elasticsearch console. Choose Create a new domain.
  2. For Deployment type, choose Development and testing. Choose Next.
  3. In the Configure Domain page:
    1. For Elasticsearch domain name, enter serverless-docrepo.
    2. Change Instance Type to t2.small.elasticsearch.
    3. Leave all the other defaults. Choose Next at the end of the page.
  4. In Network Configuration, choose Public access. This is adequate for a tutorial but in a production use-case, it’s recommended to use VPC access.
  5. Under Access Policy, in the Domain access policy dropdown, choose Custom access policy.
  6. Select IAM ARN and in the Enter principal field, enter your AWS account ID (learn how to find your AWS account ID). In the Select Action dropdown, select Allow.
  7. Under Encryption, leave HTTPS checked. Choose Next.
  8. On the Review page, review your domain configuration, and then choose Confirm.

Your domain is now being configured, and the Domain status shows Loading in the Overview tab.

After creating the Elasticsearch domain, the domain status shows ‘Loading’.

It takes 10-15 minutes to fully configure the domain. Wait until the Domain status shows Active before continuing. When the domain is ready, note the Endpoint address since you need this in the application deployment.

The Endpoint for an Elasticsearch domain.

Deploying the application

After cloning the repo to your local development machine, open a terminal window and change to the cloned directory.

  1. Run the SAM build process to create an installation package, and then deploy the application:
    sam build
    sam deploy --guided
  2. In the guided deployment process, enter unique names for the S3 buckets when prompted. For the ESdomain parameter, enter the Elasticsearch domain Endpoint from the previous section.
  3. After the deployment completes, note the Lambda function ARN shown in the output:
    Lambda function ARN output.
  4. Back in the Elasticsearch console, select the Actions dropdown and choose Modify Access Policy. Paste the Lambda function ARN as an AWS Principal in the JSON, in addition to the root user, as follows:
    Modify access policy for Elasticsearch.
  5. Choose Submit. This grants the Lambda function access to the Elasticsearch domain.

Testing the application

To test the application, you need a few test documents and images with the file types DOCX (Microsoft Word), PDF, and JPG. This walkthrough uses multiple files to illustrate the queuing process.

  1. Navigate to the S3 console and select the Documents bucket from your deployment.
  2. Choose Upload and select your sample PDF or DOCX files:
    Upload files to S3.
  3. Choose Next on the following three pages to complete the upload process. The application now analyzes these documents and adds the indexing information to Elasticsearch.

To query Elasticsearch, first you must generate an Access Key ID and Secret Access Key. For detailed steps, see this documentation on creating security credentials. Next, use Postman to create an HTTP request for the Elasticsearch domain:

  1. Download and install Postman.
  2. From Postman, enter the Elasticsearch endpoint, adding /_search?q=keyword. Replace keyword with a search term.
  3. On the Authorization tab, complete the Access Key ID and Secret Access Key fields with the credentials you created. For Service Name, enter es.Postman query with AWS authorization.
  4. Choose Send. Elasticsearch responds with document results matching your search term.REST response from Elasticsearch.

How this works

This application creates a processing pipeline between the originating Documents bucket and the Elasticsearch domain. Each document type has a custom parser, preparing the content in a Queuing bucket. It uses an Amazon SQS queue to buffer work, which is fetched by a Lambda function and analyzed with Amazon Comprehend. Finally, the indexing metadata is saved in Elasticsearch.

Serverless architecture for text extraction and classification.

  1. Documents and images are saved in the Documents bucket.
  2. Depending upon the file type, this triggers a custom parser Lambda function.
    1. For PDFs and DOCX files, the extracted text is stored in a Staging bucket. If the content is longer than 5,000 characters, it is broken into smaller files by a Lambda function, then saved in the Queued bucket.
    2. For JPG files, the parser uses Amazon Rekognition to detect the contents of the image. The labeling metadata is stored in the Queued bucket.
  3. When items are stored in the Queued bucket, this triggers a Lambda function to add the job to an SQS queue.
  4. The Analyze function is invoked when there are messages in the SQS queue. It uses Amazon Comprehend to find entities in the text. This function then stores the metadata in Elasticsearch.

S3 and Lambda both scale to handle the traffic. The Elasticsearch domain is not serverless, however, so it’s possible to overwhelm this instance with requests. There may be a large number of objects stored in the Documents bucket triggering the workflow, so the application uses SQS couple to smooth out the traffic. When large numbers of objects are processed, you see the Messages Available increase in the SQS queue:

SQS queue buffers messages to smooth out traffic.
For the Lambda function consuming messages from the SQS queue, the BatchSize configured in the SAM template controls the rate of processing. The function continues to fetch messages from the SQS queue while Messages Available is greater than zero. This can be a useful mechanism for protecting downstream services that are not serverless, or simply cannot scale to match the processing rates of the upstream application. In this case, it provides a more consistent flow of indexing data to the Elasticsearch domain.

In a production use-case, you would scale the Elasticsearch domain depending upon the load of search requests, not just the indexing traffic from this application. This tutorial uses a minimal Elasticsearch configuration, but this service is capable of supporting enterprise-scale use-cases.

Conclusion

Enterprise document repositories can be a rich source of information but can be difficult to search and index. In this blog post, I show how you can use a serverless approach to build scalable solution easily. With minimum code, we can use Amazon ML services to create the indexing metadata. By using powerful image recognition and language comprehension capabilities, this makes the metadata more useful and the search solution more accurate.

This also shows how serverless solutions can be used with existing non-serverless infrastructure, like Elasticsearch. By decoupling scalable serverless applications, you can protect downstream services from heavy traffic loads, even as Lambda scales up. Elasticsearch provides a fast, easy way to query your document repository once the serverless application has completed the indexing process.

To learn more about how to use Elasticsearch for production workloads, see the documentation on managing domains.

Announcing UltraWarm (Preview) for Amazon Elasticsearch Service

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/announcing-ultrawarm-preview-for-amazon-elasticsearch-service/

Today, we are excited to announce UltraWarm, a fully managed, low-cost, warm storage tier for Amazon Elasticsearch Service. UltraWarm is now available in preview and takes a new approach to providing hot-warm tiering in Amazon Elasticsearch Service, offering up to 900TB of storage, at almost a 90% cost reduction over existing options. UltraWarm is a seamless extension to the Amazon Elasticsearch Service experience, enabling you to query and visualize across both hot and UltraWarm data, all from your familiar Kibana interface. UltraWarm data can be queried using the same APIs and tools you use today, and also supports popular Amazon Elasticsearch Service features like encryption at rest and in flight, integrated alerting, SQL querying, and more.

A popular use case for our customers of Amazon Elasticsearch Service is to ingest and analyze high (and increasingly growing) volumes of machine-generated log data. However, those customers tell us that they want to perform real-time analysis on more of this data, so they can use it to help quickly resolve operational and security issues. Storage and analysis of months, or even years, of data has been cost prohibitive for them at scale, causing some to turn to use multiple analytics tools, while others simply delete valuable data, missing out on insights. UltraWarm, with its cost-effective storage backed by Amazon Simple Storage Service (S3), helps solve this problem, enabling customers to retain years of data for analysis.

With the launch of UltraWarm, Amazon Elasticsearch Service supports two storage tiers, hot and UltraWarm. The hot tier is used for indexing, updating, and providing the fastest access to data. UltraWarm complements the hot tier to add support for high volumes of older, less-frequently accessed, data to enable you to take advantage of a lower storage cost. As I mentioned earlier, UltraWarm stores data in S3 and uses custom, highly-optimized nodes, built on the AWS Nitro System, to cache, pre-fetch, and query that data. This all contributes to providing an interactive experience when querying and visualizing data.

The UltraWarm preview is now available to all customers in the US East (N. Virginia) and US West (Oregon) Regions. The UltraWarm tier is available with a pay-as-you-go pricing model, charging for the instance hours for your node, and utilized storage. The UltraWarm preview can be enabled on new Amazon Elasticsearch Service version 6.8 domains. To learn more, visit the technical documentation.

— Steve

 

Analyzing AWS WAF logs with Amazon ES, Amazon Athena, and Amazon QuickSight

Post Syndicated from Aaron Franco original https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/

AWS WAF now includes the ability to log all web requests inspected by the service. AWS WAF can store these logs in an Amazon S3 bucket in the same Region, but most customers deploy AWS WAF across multiple Regions—wherever they also deploy applications. When analyzing web application security, organizations need the ability to gain a holistic view across all their deployed AWS WAF Regions.

This post presents a simple approach to aggregating AWS WAF logs into a central data lake repository, which lets teams better analyze and understand their organization’s security posture. I walk through the steps to aggregate regional AWS WAF logs into a dedicated S3 bucket. I follow that up by demonstrating how you can use Amazon ES to visualize the log data. I also present an option to offload and process historical data using AWS Glue ETL. With the data collected in one place, I finally show you how you can use Amazon Athena and Amazon QuickSight to query historical data and extract business insights.

Architecture overview

The case I highlight in this post is the forensic use of the AWS WAF access logs to identify distributed denial of service (DDoS) attacks by a client IP address. This solution provides your security teams with a view of all incoming requests hitting every AWS WAF in your infrastructure.

I investigate what the IP access patterns look like over time and assess which IP addresses access the site multiple times in a short period of time. This pattern suggests that the IP address could be an attacker. With this solution, you can identify DDoS attackers for a single application, and detect DDoS patterns across your entire global IT infrastructure.

Walkthrough

This solution requires separate tasks for architecture setup, which allows you to begin receiving log files in a centralized repository, and analytics, which processes your log data into useful results.

Prerequisites

To follow along, you must have the following resources:

  • Two AWS accounts. Following AWS multi-account best practices, create two accounts:
    • A logging account
    • A resource account that hosts the web applications using AWS WAFFor more information about multi-account setup, see AWS Landing Zone. Using multiple accounts isolates your logs from your resource environments. This helps maintain the integrity of your log files and provides a central access point for auditing all application, network, and security logs.
  • The ability to launch new resources into your account. The resources might not be eligible for Free Tier usage and so might incur costs.
  • An application running with an Application Load Balancer, preferably in multiple Regions. If you do not already have one, you can launch any AWS web application reference architecture to test and implement this solution.

For this walkthrough, you can launch an Amazon ECS example from the ecs-refarch-cloudformation GitHub repo. This is a “one click to deploy” example that automatically sets up a web application with an Application Load Balancer. Launch this in two different Regions to simulate a global infrastructure. You ultimately set up a centralized bucket that both Regions log into, which your forensic analysis tools then draw from. Choose Launch Stack to launch the sample application in your Region of choice.

Setup

Architecture setup allows you to begin receiving log files in a centralized repository.

Step 1: Provide permissions

Begin this process by providing appropriate permissions for one account to access resources in another. Your resource account needs cross-account permission to access the bucket in the logging account.

  1. Create your central logging S3 bucket in the logging account and attach the following bucket policy to it under the Permissions Make a note of the bucket’s ARN. You need this information for future steps.
  2. Change RESOURCE-ACCOUNT-ID and CENTRAL-LOGGING-BUCKET-ARNto the correct values based on the actual values in your accounts:
     // JSON Document
     {
       "Version": "2012-10-17",
       "Statement": [
          {
             "Sid": "Cross Account AWS WAF Account 1",
             "Effect": "Allow",
             "Principal": {
                "AWS": "arn:aws:iam::RESOURCE-ACCOUNT-ID:root"
             },
             "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
             "Resource": [
                "CENTRAL-LOGGING-BUCKET-ARN/*"
             ]
          }
       ]
    }

Step 2: Manage Lambda permissions

Next, the Lambda function that you create in your resource account needs permissions to access the S3 bucket in your central logging account so it can write files to that location. You already provided basic cross-account access in the previous step, but Lambda still needs the granular permissions at the resources level. Remember to grant these permissions in both Regions where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. To create an IAM role for the Lambda function, in the Lambda console, choose Policies, Create Policy.
  3. Choose JSON, and enter the following policy document. Replace YOUR-SOURCE-BUCKETand YOUR-DESTINATION-BUCKET with the relative ARNs of the buckets that you are using for this walkthrough.
    // JSON document
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ListSourceAndDestinationBuckets",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:ListBucketVersions"
                ],
                "Resource": [
                    "YOUR-SOURCE-BUCKET",
                    "YOUR-DESTINATION-BUCKET"
                ]
            },
            {
                "Sid": "SourceBucketGetObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:GetObjectVersion"
                ],
                "Resource": "YOUR-SOURCE-BUCKET/*"
            },
            {
                "Sid": "DestinationBucketPutObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject"
                ],
                "Resource": "YOUR-DESTINATION-BUCKET/*"
            }
        ]
     }

  4. Choose Review policy, enter your policy name, and save it.
  5. With the policy created, create a new role for your Lambda function and attach the custom policy to that role. To do this, navigate back to the IAM dashboard.
  6. Select Create roleand choose Lambda as the service that uses the role. Select the custom policy that you created earlier in this step and choose Next. You can add tags if required and then name and create this new role.
  7. You must also add S3 as a trusted entity in the Trust Relationship section of the role. Choose Edit trust relationship and add amazonaws.com to the policy, as shown in the following example.

Lambda and S3 now appear as trusted entities under the Trust relationships tab, as shown in the following screenshot.

Step 3: Create a Lambda function and copy log files

Create a Lambda function in the same Region as your resource account’s S3 bucket. This function reads log files from the resource account bucket and then copies that content to the logging account’s bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. Navigate to Lambda in your console and choose Create Function.
  3. Choose the Author from scratch function and name it. Choose the IAM role you created in the previous step and attach it to the Lambda function.
  4. Choose Create function.
  5. This Lambda function receives a document from S3 that contains nested JSON string data. To handle this data, you must extract the JSON from this string to retrieve key names of both the document and the bucket. Your function then uses this information to copy the data to your central logging account bucket in the next step. To create this function, Copy and paste this code into the Lambda function that you created. Replace the bucket names with the names of the buckets that you created earlier. After you decide on a partitioning strategy, modify this script later.
    // Load the AWS SDK
    const aws = require('aws-sdk');
    
    // Construct the AWS S3 Object 
    const s3 = new aws.S3();
    
    //Main function
    exports.handler = (event, context, callback) => {
        console.log("Got WAF Item Event")
        var _srcBucket = event.Records[0].s3.bucket.name;
        let _key = event.Records[0].s3.object.key;
        let _keySplit = _key.split("/")
        let _objName = _keySplit[ (_keySplit.length - 1) ];
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/YOUR-DESTINATION-BUCKET/"+_objName;
        let _sourcePath = _srcBucket + "/" + _key;
        console.log(_destPath)
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

Step 4: Set S3 to Lambda event triggers

This step sets up event triggers in your resource account’s S3 buckets. These triggers send the file name and location logged by AWS WAF logs to the Lambda function. The triggers also notify the Lambda function that it must move the newly arrived file into your central logging bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Go to the S3 dashboard and choose your S3 bucket, then choose the Properties Under Advanced settings, choose Events.
  2. Give your event a name and select PUT from the Events check boxes.
  3. Choose Lambda from the Send To option and select your Lambda function as the destination for the event.

Step 5: Add AWS WAF to the Application Load Balancer

Add an AWS WAF to the Application Load Balancer so that you can start logging events. You can optionally delete the original log file after Lambda copies it. This reduces costs, but your business and security needs might err on the side of retaining that data.

Create a separate prefix for each Region in your central logging account bucket waf-central-logs so that AWS Glue can properly partition them. For best practices of partitioning with AWS Glue, see Working with partitioned data in AWS Glue. AWS Glue ingests your data and stores it in a columnar format optimized for querying in Amazon Athena. This helps you visualize the data and investigate the potential attacks.

Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF. The procedure assumes that you already have an AWS WAF enabled that you can use for this exercise. To move forward with the next step, you need AWS WAF enabled and connected to Amazon Kinesis Data Firehose for log delivery.

Setting up and configuring AWS WAF

If you don’t already have a web ACL in place, set up and configure AWS WAF at this point. This solution handles logging data from multiple AWS WAF logs in multiple Regions from more than one account.

To do this efficiently, you should consider your partitioning strategy for the data. You can grant your security teams a comprehensive view of the network. Create each partition based on the Kinesis Data Firehose delivery stream for the specific AWS WAF associated with the Application Load Balancer. This partitioning strategy also allows the security team to view the logs by Region and by account. As a result, your S3 bucket name and prefix look similar to the following example:

s3://central-waf-logs/<account_id>/<region_name>/<kinesis_firehose_name>/...filename...

Step 6: Copying logs with Lambda code

This step updates the Lambda function to start copying log files. Keep your partitioning strategy in mind as you update the Lambda function. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

To accommodate the partitioning, modify your Lambda code to match the examples in the GitHub repo.

Replace <kinesis_firehose_name> in the example code with the name of the Kinesis Data Firehose delivery stream attached to the AWS WAF. Replace <central logging bucket name> with the S3 bucket name from your central logging account.

Kinesis Data Firehose should now begin writing files to your central S3 logging bucket with the correct partitioning. To generate logs, access your web application.

Analytics

Now that Kinesis Data Firehose can write collected files into your logging account’s S3 bucket, create an Elasticsearch cluster in your logging account in the same Region as the central logging bucket. You also must create a Lambda function to handle S3 events as the central logging bucket receives new log files. This creates a connection between your central log files and your search engine. Amazon ES gives you the ability to query your logs quickly to look for potential security threats. The Lambda function loads the data into your Amazon ES cluster. Amazon ES also includes a tool named Kibana, which helps with managing data and creating visualizations.

Step 7: Create an Elasticsearch cluster

  1. In your central Logging Account, navigate to the Elasticsearch Service in the AWS Console.
  2. Select Create Cluster, enter a domain name for your cluster, and choose version 3 from the Elasticsearch version dropdown. Choose Next.In this example, don’t implement any security policies for your cluster and only use one instance. For any real-world production tasks, keep your Elasticsearch Cluster inside your VPC.
  3. For network configuration, choose Public access and choose Next.
  4. For the access policy, and this tutorial, only allow access to the domain from a specified Account ID or ARN address. In this case, use your Account ID to gain access.
  5. Choose Next and on the final screen and confirm. You generally want to create strict access policies for your domain and not allow public access. This example only uses these settings to quickly demonstrate the capabilities of AWS services. I would never recommend this in a production environment.

AWS takes a few minutes to finish and activate your Amazon ES. Once it goes live, you can see two endpoints. The Endpoint URL is the URL you use to send data to the cluster.

Step 8: Create a Lambda function to copy log files

Add an event trigger to your central logs bucket. This trigger tells your Lambda function to write the data from the log file to Amazon ES. Before you create the S3 trigger, create a Lambda function in your logging account to handle the events.

For this Lambda function, we use code from the aws-samples GitHub repository that streams data from an S3 file line by line into Amazon ES. This example uses code taken from amazon-elasticsearch-lambda-samples. Name your new Lambda function myS3toES.

  1. Copy and paste the following code into a text file named js:
    exports.handler = (event, context, callback) => {
        // get the source bucket name
        var _srcBucket = event.Records[0].s3.bucket.name;
            // get the object key of the file that landed on S3
        let _key = event.Records[0].s3.object.key;
        
        // split the key by "/"
        let _keySplit = _key.split("/")
            // get the object name
        let _objName = _keySplit[ (_keySplit.length - 1) ];
            // reset the destination path
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/<kinesis_firehose_name>/"+_objName;
            // setup the source path
        let _sourcePath = _srcBucket + "/" + _key;
            // build the params for the copyObject request to S3
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
            // execute the copyObject request
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

  2. Copy and paste this code into a text file and name it json:
    //JSON Document
    {
      "name": "s3toesfunction",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "scripts": {},
      "author": "",
      "dependencies": {
        "byline": "^5.0.0",
        "clf-parser": "0.0.2",
        "path": "^0.12.7",    "stream": "0.0.2"
      }
    }

  3. Execute the following command in the folder containing these files:> npm install
  4. After the installation completes, create a .zip file that includes the js file and the node_modules folder.
  5. Log in to your logging account.
  6. Upload your .zip file to the Lambda function. For Code entry type, choose Upload a .zip file.
  7. This Lambda function needs an appropriate service role with a trust relationship to S3. Choose Edit trust relationships and add amazonaws.com and lambda.amazonaws.com as trusted entities.
  8. Set up your IAM role with the following permissions: S3 Read Only permissions and Lambda Basic Execution. To grant the role the appropriate access, assign it to the Lambda function from the Lambda Execution Role section in the console.
  9. Set Environment variables for your Lambda function so it knows where to send the data. Add an endpoint and use the endpoint URL you created in Step 7. Add an index and enter your index name. Add a value for region and detail the Region where you deployed your application.

Step 9: Create an S3 trigger

After creating the Lambda function, create the event triggers on your S3 bucket to execute that function. This completes your log delivery pipeline to Amazon ES. This is a common pipeline architecture for streaming data from S3 into Amazon S3.

  1. Log in to your central logging account.
  2. Navigate to the S3 console, select your bucket, then open the Properties pane and scroll down to Events.
  3. Choose Add notification and name your new event s3toLambdaToEs.
  4. Under Events, select the check box for PUT. Leave Prefix and Suffix
  5. Under Send to, select Lambda Function, and enter the name of the Lambda function that you created in the previous step—in this example, myS3toES.
  6. Choose Save.

With this complete, Lambda should start sending data to your Elasticsearch index whenever you access your web application.

Step 10: Configure Amazon ES

Your pipeline now automatically adds data to your Elasticsearch cluster. Next, use Kibana to visualize the AWS WAF logs in the central logging account’s S3 bucket. This is the final step in assembling your forensic investigation architecture.

Kibana provides tools to create visualizations and dashboards that help your security teams view log data. Using the log data, you can filter by IP address to see how many times an IP address has hit your firewall each month. This helps you track usage anomalies and isolate potentially malicious IP addresses. You can use this information to add web ACL rules to your firewall that adds extra protection against those IP addresses.

Kibana produces visualizations like the following screenshot.

In addition to the Number of IPs over Time visualization, you can also correlate the IP address to its country of origin. Correlation provides even more precise filtering for potential web ACL rules to protect against attackers. The visualization for that data looks like the following image.

Elasticsearch setup

To set up and visualize your AWS WAF data, follow this How to analyze AWS WAF logs using Amazon Elasticsearch Service post. With this solution, you can investigate your global dataset instead of isolated Regions.

An alternative to Amazon ES

Amazon ES is an excellent tool for forensic work because it provides high-performance search capability for large datasets. However, Amazon ES requires cluster management and complex capacity planning for future growth. To get top-notch performance from Amazon ES, you must adequately scale it. With the more straightforward data of these investigations, you could instead work with more traditional SQL queries.

Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database. To learn more about this option, consult posts like How to extract, transform, and load data from analytic processing using AWS Glue and Work with partitioned data in AWS Glue.

Athena query

With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization. Use the Athena console to experiment until you have the best query for your visual needs. Having the database in your AWS Glue Catalog means you can make ad hoc queries in Athena to inspect your data.

In the Athena console, create a new Query tab and enter the following query:

# SQL Query
SELECT date_format(from_unixtime("timestamp"/1000), '%Y-%m-%d %h:%i:%s') as event_date, client_ip, country, account_id, waf_name, region FROM "paritionedlogdata"."waf_logs_transformed" where year='2018' and month='12';

Replace <your-database-name> and <your-table-name> with the appropriate values for your environment. This query converts the numerical timestamp to an actual date format using the SQL according to Presto 0.176 documentation. It should return the following results.

You can see which IP addresses hit your environment the most over any period of time. In a production environment, you would run an ETL job to re-partition this data and transform it into a columnar format optimized for queries. If you would like more information about doing that, see the Best Practices When Using Athena with AWS Glue post.

Amazon QuickSight visualization

Now that you can query your data in Athena, you can visualize the results using Amazon QuickSight. First, grant Amazon QuickSight access to the S3 bucket where your Athena query results live.

  1. In the Amazon QuickSight console, log in.
  2. Choose Admin/username, Manage QuickSight.
  3. Choose Account settings, Security & permissions.
  4. Under QuickSight access to AWS services, choose Add or remove.
  5. Choose Amazon S3, then choose Select S3 buckets.
  6. Choose the output bucket for your central AWS WAF logs. Also, choose your Athena query results bucket. The query results bucket begins with aws-athena-query-results-*.

Amazon QuickSight can now access the data sources. To set up your visualizations, follow these steps:

  1. In the QuickSight console, choose Manage data, New data set.
  2. For Source, choose Athena.
  3. Give your new dataset a name and choose Validate connection.
  4. After you validate the connection, choose Create data source.
  5. Select Use custom SQL and give your SQL query a name.
  6. Input the same query that you used earlier in Athena, and choose Confirm query.
  7. Choose Import to SPICE for quicker analytics, Visualize.

Allow Amazon QuickSight several minutes. It alerts you after completing the import.

Now that you have imported your data into your analysis, you can apply a visualization:

  1. In Amazon QuickSight, New analysis.
  2. Select the last dataset that you created earlier and choose Create analysis.
  3. At the bottom left of the screen, choose Line Chart.
  4. Drag and drop event_date to the X-Axis
  5. Drag and drop client_ip to the ValueThis should create a visualization similar to the following image.
  6. Choose the right arrow at the top left of the visualization and choose Hide “other” categories.This should modify your visualization to look like the following image.

You can also map the countries from which the requests originate, allowing you to track global access anomalies. You can do this in QuickSight by selecting the “Points on map” visualization type and choosing the country as the data point to visualize.

You can also add a count of IP addresses to see if you have any unusual access patterns originating from specific IP addresses.

Conclusion

Although Amazon ES and Amazon QuickSight offer similar final results, there are trade-offs to the technical approaches that I highlighted. If your use case requires the analysis of data in real time, then Amazon ES is more suitable for your needs. If you prefer a serverless approach that doesn’t require capacity planning or cluster management, then the solution with AWS Glue, Athena, and Amazon QuickSight is more suitable.

In this post, I described an easy way to build operational dashboards that track key metrics over time. Doing this with AWS Glue, Athena, and Amazon QuickSight relieves the heavy lifting of managing servers and infrastructure. To monitor metrics in real time instead, the Amazon ES solution provides a way to do this with little operational overhead. The key here is the adaptability of the solution: putting different services together can provide different solutions to your problems to fit your exact needs.

For more information and use cases, see the following resources:

Hopefully, you have found this post informative and the proposed solutions intriguing. As always, AWS welcomes all feedback or comment.

 


About the Authors

Aaron Franco is a solutions architect at Amazon Web Services .

 

 

 

 

 

 

 

 

Setting alerts in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/setting-alerts-in-amazon-elasticsearch-service/

Customers often use Amazon Elasticsearch Service for log analytics. Amazon ES lets you collect logs from your infrastructure, transform each log line into a JSON document, and send those documents to the bulk API.

A transformed log line contains many fields, each containing values. For instance, an Apache web log line includes a source IP address field, a request URL field, and a status code field (among others). Many users build dashboards—using Kibana—to monitor their infrastructure visually, surfacing application usage, bugs, or security problems evident from the data in those fields. For example, you can graph the count of HTTP 5xx status codes, then watch and react to changes. If you see a sudden jump in 5xx codes, you likely have a server issue. But with this system, you must monitor Kibana manually.

On April 8, Amazon ES launched support for event monitoring and alerting. To use this feature, you work with monitors—scheduled jobs—that have triggers, which are specific conditions that you set, telling the monitor when it should send an alert. An alert is a notification that the triggering condition occurred. When a trigger fires, the monitor takes action, sending a message to your destination.

This post uses a simulated IoT device farm to generate and send data to Amazon ES.

Simulator overview

This simulation consists of several important parts: sensors and devices.

Sensors

The core class for the simulator is the sensor. Devices have sensors that simulate different patterns of floating-point values. When called, each sensor’s report method updates and returns the value of its sensor. There are several subclasses for Sensor:

  • SineSensor: Produces a sin wave, based on the current timestamp.
  • ConstSensor: Produces a constant value. The class includes a random “fuzz” factor to drift around a particular value.
  • DriftingSensor: Allows continuous, random drift with a starting value.
  • MonotonicSensor: Increments its value by a constant delta, with random fuzz.

For this post, I used MonotonicSensor, whose value constantly increases, to force a breach in an alert that I set up.

You can identify a sensor by a universally unique identifier (UUID) and a label for the metric that it tracks. The report function for the sensor class returns a timestamp, the UUID of the sensor, the metric label, and the metric’s value at that instant.

Devices

Devices are collections of sensors. For this post, I created a collection of devices that simulate IoT devices in a field, measuring the temperature and humidity, and sending the CPU of the device. Each has a report method that recursively calls the report methods for all their sensors, returning a collection of the sensor reports. I made the code available in the Open Distro for Elasticsearch sample code repository on GitHub.

I set the CPU sensor of one device to drift constantly upward, simulating a problem in the device. You can see the intended “bad behavior” in the following line graph:

In the next sections, I set up an alert at 90% CPU so that I can catch and correct the situation.

Prerequisites

To follow along with this solution, you need an AWS account. Set up your own Amazon ES domain, to form the basis of your monitors and alerts.

Step 1: Set up your destination

When you create alerts in Amazon ES, you assign one destination or multiple. A destination is a delivery channel, where your domain sends notifications when your alerts trigger. You can use Amazon SNS, your Slack channel, or Amazon Chime as your destination. Or, you can set up a custom webhook (a URL) to receive messages. You set up headers and the message body, and Amazon ES Alerts posts the message to the destination URL.

I use SNS to receive alerts from my Amazon ES domain in this example, but Amazon ES provides many options for setting up your topic and subscription. I created the topic to receive notifications and subscribe to the topic for email delivery. Your SNS topic can have many subscriptions, supporting delivery via HTTP/S endpoint, email, Amazon SQS, AWS Lambda, and SMS.

To set up your destination, navigate to the AWS Management Console. Sign in and open the SNS console.

Choose Topics, Create Topic.

In the Create topic page, fill out values for Name and Display name. I chose sensor-alerting for both. Choose Create topic.

Now subscribe to your topic. You can do this from the topic page, as the console automatically returns you there when you complete topic creation. You can also subscribe from the Subscriptions tab in the left navigation pane. From the topic page, choose Create subscription.

On the Create Subscription page, for Protocol, choose Email. Fill in your email address in the Endpoint box and choose Create subscription. Make a note of the Topic ARN here, as you refer to it again later.

Finally, confirm your subscription by clicking the confirmation link in the email that SNS sends to you.

Step 2: Set up a role

To let Amazon ES publish alerts to your topic, create an IAM role with the proper permissions. Before you get started, copy the Topic ARN from the SNS topic page in Step 1.

Your role has two components: trusted entities and permissions for entities that assume the role. The console doesn’t support creating a role with Amazon ES as a trusted entity. Create a role with EC2 as the trusted entity and then edit the JSON trust document to change the entity.

In the AWS Management Console, open the IAM console and choose Roles, Create role.

On the Create role page, choose AWS Service and EC2. Choose Next: Permissions.

On the permissions page, choose Create policy. This brings you to a new window to create the policy. Don’t close the old tab, as you return to it in a moment.

The policy that you create in this step defines the permissions for entities that assume the role. Add a policy document that allows various entities (Amazon ES in this case) to publish to your SNS topic.

On the Create policy page, choose the JSON tab and copy-paste to replace the JSON text with the following code. Replace the sns-topic-arn in the code with the ARN for the topic that you created earlier. After you have done this, choose Review policy.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sns:Publish",
    "Resource": "sns-topic-arn"
  }]
}

On the Review policy page, give your policy a name. I chose SensorAlertingPolicy in this example. Choose Create policy.

Return to the Create role window or tab. Use the refresh button to reload the policies and type the name of your policy in the search box. Select the check box next to your policy. Choose Next: Tags, then choose Next: Review. You can also add tags to make your role easier to search.

On the Review page, give your role a name. I used SensorAlertingRole in this example. Choose Create role.

To change the trusted entity for the role to Amazon ES, in the IAM console, choose Roles. Type SensorAlertingRole in the search box, and choose the link (not the check box) to view that role. Choose Trust relationships, Edit trust relationship.

Edit the Policy Document code to replace ec2.amazonaws.com with es.amazonaws.com. Your completed policy document should look like the following code example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "es.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Choose Update Trust Policy. Make a note of your role ARN, as you refer to it again.

Step 3: Set up Amazon ES alerting

I pointed my IoT sensor simulator at my Amazon ES domain. This creates data that serves as the basis for the monitors and alerts. To do this yourself, navigate to your Kibana endpoint in your browser and choose Alerting in the left navigation pane. At the top of the window, choose Destinations, Add Destination.

In the Add Destination dialog, give your destination a name. For Type, choose SNS, and set the SNS topic ARN to the topic ARN that you created in Step 1. Set the IAM role ARN to the role ARN that you created in Step 2. Choose Create. You can set as many destinations as you like, allowing you to alert multiple people in the event of a problem.

Step 4: Set up a monitor

Monitors in Open Distro for Amazon ES allow you to specify a value to monitor. You can select the value either graphically or by specifying an Amazon ES query. You define a monitor first and then define triggers for the monitored value.

In Kibana, choose Monitors, Create Monitor.

Give your monitor a name. I named my monitor Device CPUs. You can set the frequency to one of the predefined intervals, or use a cron expression for more granular control. I chose Every 1 minute.

Scroll to the Define Monitor section of the page. Use this set of controls to specify the value to monitor. You can enter a value for Index or Indexes, Time field, and a target value. Choose Define using visual graph from the How do you want to define your monitor? list. You can also enter information for Define using extraction query, allowing you to provide a query that produces a value to monitor. For simple thresholds, the visual interface is fast and easy.

Select the Index value to monitor from the list. The list contains individual indexes. To use a wildcard, you can also type in the text box. For the value to register, you must press Enter after typing the index name (for example, “logs-*” <enter>).

Choose a value for Time field from the list. This reveals several selectors on top of a graph. Choose Count() and open the menu to see the aggregations for computing the value. Choose max(), then choose CPU for Select a field. Finally, set FOR THE LAST to 5 minute(s). Choose Create.

You can create your monitor visually or provide a query to produce the value to monitor.

I chose the logs-* index to monitor the max value of the CPU field, but this doesn’t create a trigger yet. Choose Create. This brings you to the Define Trigger page.

Step 5: Create a trigger

To create a trigger, specify the threshold value for the field that you’re monitoring. When the value of the field exceeds the threshold, the monitor enters an Active state. I created a trigger called CPU Too High, with a threshold value of 90 and a severity level of 1.

When you set the trigger conditions, set the action or actions that Amazon ES performs.

To add actions, scroll through the page. I added one action to send a message to my SNS topic—including the monitor name, trigger, severity, and the period over which the alarm has been active. You can use Mustache scripting to create a template for the message that you receive.

After you finish adding actions, choose Create at the bottom of the page.

Wrap up

When you return to the Alerting Dashboard, your alert appears in the Completed state. Alerts can exist in a variety of states. Completed signals that the monitor successfully queried your target, and that the trigger is not engaged.

To send the alert into the Active state, I sent simulated sensor data with a failing device whose CPU ramped up from 50% to 100%. When it hit 90%, I received the following email:

Conclusion

In this post, I demonstrated how Amazon ES alerting lets you monitor the critical data in your log files so that you can respond quickly when things start to go wrong. By identifying KPIs, setting thresholds, and distributing alerts to your first responders, you can improve your response time for critical issues.

If you have questions or feedback, leave them below, or reach out on Twitter!


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

 

 

Alerting, monitoring, and reporting for PCI-DSS awareness with Amazon Elasticsearch Service and AWS Lambda

Post Syndicated from Michael Coyne original https://aws.amazon.com/blogs/security/alerting-monitoring-and-reporting-for-pci-dss-awareness-with-amazon-elasticsearch-service-and-aws-lambda/

Logging account activity within your AWS infrastructure is paramount to your security posture and could even be required by compliance standards such as PCI-DSS (Payment Card Industry Security Standard). Organizations often analyze these logs to adapt to changes and respond quickly to security events. For example, if users are reporting that their resources are unable to communicate with the public internet, it would be beneficial to know if a network access list had been changed just prior to the incident. Many of our customers ship AWS CloudTrail event logs to an Amazon Elasticsearch Service cluster for this type of analysis. However, security best practices and compliance standards could require additional considerations. Common concerns include how to analyze log data without the data leaving the security constraints of your private VPC.

In this post, I’ll show you not only how to store your logs, but how to put them to work to help you meet your compliance goals. This implementation deploys an Amazon Elasticsearch Service domain with Amazon Virtual Private Cloud (Amazon VPC) support by utilizing VPC endpoints. A VPC endpoint enables you to privately connect your VPC to Amazon Elasticsearch without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. An AWS Lambda function is used to ship AWS CloudTrail event logs to the Elasticsearch cluster. A separate AWS Lambda function performs scheduled queries on log sets to look for patterns of concern. Amazon Simple Notification Service (SNS) generates automated reports based on a sample set of PCI guidelines discussed further in this post and notifies stakeholders when specific events occur. Kibana serves as the command center, providing visualizations of CloudTrail events that need to be logged based on the provided sample set of PCI-DSS compliance guidelines. The automated report and dashboard that are constructed around the sample PCI-DSS guidelines assist in event awareness regarding your security posture and should not be viewed as a de facto means of achieving certification. This solution serves as an additional tool to provide visibility in to the actions and events within your environment. Deployment is made simple with a provided AWS CloudFormation template.
 

Figure 1: Architectural diagram

Figure 1: Architectural diagram

The figure above depicts the architecture discussed in this post. An Elasticsearch cluster with VPC support is deployed within an AWS Region and Availability Zone. This creates a VPC endpoint in a private subnet within a VPC. Kibana is an Elasticsearch plugin that resides within the Elasticsearch cluster, it is accessed through a provided endpoint in the output section of the CloudFormation template. CloudTrail is enabled in the VPC and ships CloudTrail events to both an S3 bucket and CloudWatch Log Group. The CloudWatch Log Group triggers a custom Lambda function that ships the CloudTrail Event logs to the Elasticsearch domain through the VPC endpoint. An additional Lambda function is created that performs a periodic set of Elasticsearch queries and produces a report that is sent to an SNS Topic. A Windows-based EC2 instance is deployed in a public subnet so users will have the ability to view and interact with a Kibana dashboard. Access to the EC2 instance can be restricted to an allowed CIDR range through a parameter set in the CloudFormation deployment. Access to the Elasticsearch cluster and Kibana is restricted to a Security Group that is created and is associated with the EC2 instance and custom Lambda functions.

Sample PCI-DSS Guidelines

This solution provides a sample set of (10) PCI-DSS guidelines for events that need to be logged.

  • All Commands, API action taken by AWS root user
  • All failed logins at the AWS platform level
  • Action related to RDS (configuration changes)
  • Action related to enabling/disabling/changing of CloudTrail, CloudWatch logs
  • All access to S3 bucket that stores the AWS logs
  • Action related to VPCs (creation, deletion and changes)
  • Action related to changes to SGs/NACLs (creation, deletion and changes)
  • Action related to IAM users, roles, and groups (creation, deletion and changes)
  • Action related to route tables (creation, deletion and changes)
  • Action related to subnets (creation, deletion and changes)

Solution overview

In this walkthrough, you’ll create an Elasticsearch cluster within an Amazon VPC environment. You’ll ship AWS CloudTrail logs to both an Amazon S3 Bucket (to maintain an immutable copy of the logs) and to a custom AWS Lambda function that will stream the logs to the Elasticsearch cluster. You’ll also create an additional Lambda function that will run once a day and build a report of the number of CloudTrail events that occurred based on the example set of 10 PCI-DSS guidelines and then notify stakeholders via SNS. Here’s what you’ll need for this solution:

To make it easier to get started, I’ve included an AWS CloudFormation template that will automatically deploy the solution. The CloudFormation template along with additional files can be downloaded from this link. You’ll need the following resources to set it up:

  • An S3 bucket to upload and store the sample AWS Lambda code and sample Kibana dashboards. This bucket name will be requested during the CloudFormation template deployment.
  • An Amazon Virtual Private Cloud (Amazon VPC).

If you’re unfamiliar with how CloudFormation templates work, you can find more info in the CloudFormation Getting Started guide.

AWS CloudFormation deployment

The following parameters are available in this template.

ParameterDefaultDescription
Elasticsearch Domain NameName of the Amazon Elasticsearch Service domain.
Elasticsearch Version6.2Version of Elasticsearch to deploy.
Elasticsearch Instance Count3The number of data nodes to deploy in to the Elasticsearch cluster.
Elasticsearch Instance ClassThe instance class to deploy for the Elasticsearch data nodes.
Elasticsearch Instance Volume Size10The size of the volume for each Elasticsearch data node in GB.
VPC to launch intoThe VPC to launch the Amazon Elasticsearch Service cluster into.
Availability Zone to launch intoThe Availability Zone to launch the Amazon Elasticsearch Service cluster into.
Private Subnet IDThe subnet to launch the Amazon Elasticsearch Service cluster into.
Elasticsearch Security GroupA new Security Group is created that will be associated with the Amazon Elasticsearch Service cluster.
Security Group DescriptionA description for the above created Security Group.
Windows EC2 Instance Classm5.largeWindows instance for interaction with Kibana.
EC2 Key PairEC2 Key Pair to associate with the Windows EC2 instance.
Public SubnetPublic subnet to associate with the Windows EC2 instance for access.
Remote Access Allowed CIDR0.0.0.0/0The CIDR range to allow remote access (port 3389) to the EC2 instance.
S3 Bucket Name—Lambda FunctionsS3 Bucket that contains custom AWS Lambda functions.
Private SubnetPrivate subnet to associate with AWS Lambda functions that are deployed within a VPC.
CloudWatch Log Group NameThis will create a CloudWatch Log Group for the AWS CloudTrail event logs.
S3 Bucket Name—CloudTrail loggingThis will create a new Amazon S3 Bucket for logging CloudTrail events. Name must be a globally unique value.
Date range to perform queriesnow-1d(examples: now-1d, now-7d, now-90d)
Lambda Subnet CIDRCreate a Subnet CIDR to deploy AWS Lambda Elasticsearch query function in to
Availability Zone—LambdaThe availability zone to associate with the preceding AWS Lambda Subnet
Email Address[email protected]Email address for reporting to notify stakeholders via SNS. You must accept the subscription by selecting the link sent to this address before alerts will arrive.

It takes 30-45 minutes for this stack to be created. When it’s complete, the CloudFormation console will display the following resource values in the Outputs tab. These values can be referenced at any time and will be needed in the following sections.

oElasticsearchDomainEndpointElasticsearch Domain Endpoint Hostname
oKibanaEndpointKibana Endpoint Hostname
oEC2InstanceWindows EC2 Instance Name used for Kibana access
oSNSSubscriberSNS Subscriber Email Address
oElasticsearchDomainArnArn of the Elasticsearch Domain
oEC2InstancePublicIpPublic IP address of the Windows EC2 instance

Managing and testing the solution

Now that you’ve set up the environment, it’s time to configure the Kibana dashboard.

Kibana configuration

From the AWS CloudFormation output, gather information related to the Windows-based EC2 instance. Once you have retrieved that information, move on to the next steps.

Initial configuration and index pattern

  1. Log into the Windows EC2 instance via Remote Desktop Protocol (RDP) from a resource that is within the allowed CIDR range for remote access to the instance.
  2. Open a browser window and navigate to the Kibana endpoint hostname URL from the output of the AWS CloudFormation stack. Access to the Elasticsearch cluster and Kibana is restricted to the security group that is associated with the EC2 instance and custom Lambda functions during deployment.
  3. In the Kibana dashboard, select Management from the left panel and choose the link for Index Patterns.
  4. Add one index pattern containing the following: cwl-*
     
    Figure 2: Define the index pattern

    Figure 2: Define the index pattern

  5. Select Next Step.
  6. Select the Time Filter Field named @timestamp.
     
    Figure 3: Select "@timestamp"

    Figure 3: Select “@timestamp”

  7. Select Create index pattern.

At this point we’ve launched our environment and have accessed the Kibana console. Within the Kibana console, we’ve configured the index pattern for the CloudWatch logs that will contain the CloudTrail events. Next, we’ll configure visualizations and a dashboard.

Importing sample PCI DSS queries and Kibana dashboard

  1. Copy the export.json from the location you extracted the downloaded zip file to the EC2 Kibana bastion.
  2. Select Management on the left panel and choose the link for Saved Objects.
  3. Select Import in upper right corner and navigate to export.json.
  4. Select Yes, overwrite all saved objects, then select Index Pattern cwl-* and confirm all changes.
  5. Once the import completes, select PCI DSS Dashboard to see the sample dashboard and queries.

Note: You might encounter an error during the import that looks like this:
 

Figure 4: Error message

Figure 4: Error message

This simply means that your streamed logs do not have login-type events in the time period since your deployment. To correct this, you can add a field with a null event.

  1. From the left panel, select Dev Tools and copy the following JSON into the left panel of the console:
    
            POST /cwl-/default/
            {
                "userIdentity": {
                    "userName": "test"
                }
            }              
     

  2. Select the green Play triangle to execute the POST of a document with the missing field.
     
    Figure 5: Select the "Play" button

    Figure 5: Select the “Play” button

  3. Now reimport the dashboard using the steps in Importing Sample PCI DSS Queries and Kibana Dashboard. You should be able to complete the import with no errors.

At this point, you should have CloudTrail events that have been streamed to the Elasticsearch cluster, with a configured Kibana dashboard that looks similar to the following graphic:
 

Figure 6: A configured Kibana dashboard

Figure 6: A configured Kibana dashboard

Automated Reports

A custom AWS Lambda function was created during the deployment of the Amazon CloudFormation stack. This function uses the sample PCI-DSS guidelines from the Kibana dashboard to build a daily report. The Lambda function is triggered every 24 hours and performs a series of Elasticsearch time-based queries of now-1day (the last 24 hours) on the sample guidelines. The results are compiled into a message that is forwarded to Amazon Simple Notification Service (SNS), which sends a report to stakeholders based on the email address you provided in the CloudFormation deployment.

The Lambda function will be named <CloudFormation Stack Name>-ES-Query-LambdaFunction. The Lambda Function enables environment variables such as your query time window to be adjusted or additional functionality like additional Elasticsearch queries to be added to the code. The below sample report allows you to monitor any events against the sample PCI-DSS guidelines. These reports can then be further analyzed in the Kibana dashboard.


    Logging Compliance Report - Wednesday, 11. July 2018 01:06PM
    Violations for time period: 'now-1d'
    
    All Failed login attempts
    - No Alerts Found
    All Commands, API action taken by AWS root user
    - No Alerts Found
    Action related to RDS (configuration changes)
    - No Alerts Found
    Action related to enabling/disabling/changing of CloudTrail CloudWatch logs
    - 3 API calls indicating alteration of log sources detected
    All access to S3 bucket that stores the AWS logs
    - No Alerts Found
    Action related to VPCs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to SGs/NACLs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to IAM roles, users, and groups (creation, deletion and changes)
    - 2 API calls indicating creation, alteration or deletion of IAM roles, users, and groups
    Action related to changes to Route Tables (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to Subnets (creation, deletion and changes)
    - No Alerts Found         

Summary

At this point, you have now created a private Elasticsearch cluster with Kibana dashboards that monitors AWS CloudTrail events on a sample set of PCI-DSS guidelines and uses Amazon SNS to send a daily report providing awareness in to your environment—all isolated securely within a VPC. In addition to CloudTrail events streaming to the Elasticsearch cluster, events are also shipped to an Amazon S3 bucket to maintain an immutable source of your log files. The provided Lambda functions can be further modified to add additional or more complex search queries and to create more customized reports for your organization. With minimal effort, you could begin sending additional log data from your instances or containers to gain even more insight as to the security state of your environment. The more data you retain, the more visibility you have into your resources and the closer you are to achieving Compliance-on-Demand.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Michael Coyne

Michael is a consultant for AWS Professional Services. He enjoys the fast-paced environment of ever-changing technology and assisting customers in solving complex issues. Away from AWS, Michael can typically be found with a guitar and spending time with his wife and two young kiddos. He holds a BS in Computer Science from WGU.

How to analyze AWS WAF logs using Amazon Elasticsearch Service

Post Syndicated from Tom Adamski original https://aws.amazon.com/blogs/security/how-to-analyze-aws-waf-logs-using-amazon-elasticsearch-service/

Log analysis is essential for understanding the effectiveness of any security solution. It can be valuable for day-to-day troubleshooting and also for your long-term understanding of how your security environment is performing.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources. AWS WAF gives you control over which traffic to allow or block to your web applications by defining customizable web security rules.

With the release of access to full AWS WAF logs, you now have the ability to view all of the logs generated by AWS WAF while it’s protecting your web applications. In addition, you can use Amazon Kinesis Data Firehose to forward the logs to Amazon Simple Storage Service (Amazon S3) for archival, and to Amazon Elasticsearch Service to analyze access to your web applications.

In this blog post, I’ll show you how you can analyze AWS WAF logs using Amazon Elasticsearch Service (Amazon ES). I’ll walk you through how to find out in near-real time which AWS WAF rules get triggered, why, and by which request. I’ll also show you how to create a historical view of your web applications’ access trends for long-term analysis.

Architecture overview

When enabled, AWS WAF sends logs to Amazon Kinesis Data Firehose. From there, you can decide what to do with them next. In my architecture, I’ll forward the relevant logs to Amazon ES for analysis and to Amazon S3 for long-term storage.
 

Figure 1: Architecture overview

Figure 1: Architecture overview

In this blog, I will show you how to enable AWS WAF logging to Amazon ES in the following steps:

  1. Create an Amazon Elasticsearch Service domain.
  2. Configure Kinesis Data Firehose for log delivery.
  3. Enable AWS WAF logs on a web access control list (ACL) to send data to Kinesis Data Firehose.
  4. Analyze AWS WAF logs in Amazon ES.

Note: The region in which you should enable Kinesis Data Firehose must match the region where your AWS WAF web ACL is deployed. If you’re enabling logging for AWS WAF deployed on Amazon CloudFront, then Kinesis Data Firehose must be created in the Northern Virginia AWS Region.

Preparing Amazon Elasticsearch Service

If you don’t have your Amazon ES environment already running on AWS, you’ll need to create one. Start by navigating to the Elasticsearch Service from your AWS Console and choosing Create a new domain. There, you will be able to define the namespace for your Elasticsearch cluster and choose the version you’d like to use. In my setup, I’ll be using awswaf-logs as my namespace and version 6.3 of Elasticsearch.
 

Figure 2: Defining Amazon ES Domain

Figure 2: Defining Amazon ES Domain

You can also decide on other parameters, like the number of nodes in your cluster or how to access the dashboards. See the Amazon Elasticsearch Service Documentation to find out more details about setting up your environment, or review this previous blog post, which goes deeper into Amazon ES setup.

When your environment is ready, you’ll be able to click on your namespace and find the link to your Kibana dashboard. Kibana is the data visualisation plugin for Elasticsearch that you’ll use to analyze your logs.
 

Figure 3: Identifying the Kibana management URL

Figure 3: Identifying the Kibana management URL

While Elasticsearch can automatically classify most of the fields from the AWS WAF logs, you need to inform Elasticsearch how to interpret fields that have specific formatting. Therefore, before you start sending logs to Elasticsearch, you should create an index pattern template so Elasticsearch can distinguish AWS WAF logs and classify the fields correctly.

The pattern template I’m using below will be applied to all indexes beginning with the string awswaf-. That’s because in my setup all AWS WAF log index files created in Elasticsearch will always begin with the character string awswaf- followed by a date.

The pattern template is defining two fields from the logs. It will indicate to Elasticsearch that the httpRequest.clientIp field is using an IP address format and that the timestamp field is represented in epoch time. All the other log fields will be classified automatically.

In the template, I’m also setting the number of Elasticsearch shards that should be used for the index. Shards help with distributing large amount of data and their number depends on the size of the index—the larger the index the more shards it should be deployed across. I’m going to use only one shard since I expect my index to be less than 30 GB. For recommendations on sizing your shards appropriately, refer to Jon Handler’s blog post on using Amazon Elasticsearch Service.


PUT  _template/awswaf-logs
{
    "index_patterns": ["awswaf-*"],
    "settings": {
    "number_of_shards": 1
    },
    "mappings": {
      "waflog": {
        "properties": {
          "httpRequest": {
            "properties": {
              "clientIp": {
                "type": "keyword",
                "fields": {
                  "keyword": {
                    "type": "ip"
                  }
                }
              }
            }
          },
          "timestamp": {
            "type": "date",
            "format": "epoch_millis"
          }
      }
    }
  }
}

I’m going to apply this template to Elasticsearch through the Kibana developer interface (from the Dev Tools tab in the Kibana dashboard), but it can also be done through an API call to the Amazon ES endpoint.
 

Figure 4: Applying the index pattern template in Kibana

Figure 4: Applying the index pattern template in Kibana

The template will be applied to all log indexes starting with the awswaf- prefix. This is the prefix name you’ll use at a later stage in the Kinesis Data Firehose configuration.

Setting up Kinesis for log delivery

With Amazon ES now ready to receive AWS WAF logs, I’ll show you how to use Amazon Kinesis Data Firehose to deliver your logs to Amazon S3 and Amazon ES. This is a prerequisite before you enable logging in AWS WAF.

First, go to the Kinesis service in the AWS Console and under the Data Firehose tab choose Create delivery stream. If the Data Firehose stream is going to be used for WAF logs, its name must start with aws-waf-logs-. I’ll enable logging for the already existing web ACL that I’m using for OWASP top 10 protection, therefore I’ll name my delivery stream aws-waf-logs-owasp.

Keep all the settings at the default until you get to the Select destination page. On this page, you’ll configure Amazon Elasticsearch Service as the destination for your logs. Make sure you enter awswaf as the index name and waflog as the type to match the index pattern template you already created in Elasticsearch earlier.
 

Figure 5: Setting up Amazon ES as Data Firehose log destination

Figure 5: Setting up Amazon ES as Data Firehose log destination

On this page, you can also set what to back up to Amazon S3—either all records or just the failed ones. You’ll then need to decide on the bucket to use, specify the prefix, and then select Next.
 

Figure 6: Setting up S3 backup in Data Firehose

Figure 6: Setting up S3 backup in Data Firehose

On the next page, you’ll have the option to update the buffer conditions—size and interval. These indicate how often Kinesis Data Firehose will send new data to Amazon ES. The lower the values you set, the closer to real time your log delivery will be. In my example, I’ve configured a buffer size of 1 MB and a buffer interval of 60 seconds.
 

Figure 7: Setting up Data Firehose buffer conditions

Figure 7: Setting up Data Firehose buffer conditions

Next, you’ll need to specify an IAM role that enables Kinesis Data Firehose to send data to Amazon S3 and Amazon ES. You can select an existing IAM role or create a new one. For more information about creating a role, see Controlling Access with Amazon Kinesis Data Firehose.

Finally, review all the settings, and complete the creation of your Kinesis Data Firehose.

Enabling AWS WAF logging

Now that you have Kinesis Data Firehose prepared, you can enable logging for an existing AWS WAF Web ACL. You can do this in the AWS Console under the AWS WAF & AWS Shield service. Go to the Web ACLs tab and select the Web ACL for which you want to start logging. Then, under the Logging tab, choose Enable Logging.
 

Figure 8: Enabling logging for AWS WAF web ACL

Figure 8: Enabling logging for AWS WAF web ACL

On the next page, specify the Kinesis Data Firehose that the logs should be delivered to. If you can’t see any options in the drop-down list, make sure that the Kinesis Data Firehose you configured earlier is deployed in the same region as the web ACL you’re enabling logging for and that its name begins with aws-waf-logs-. Remember that for CloudFront web ACL deployments, the Kinesis Data Firehose should be created in the Northern Virginia Region.

Searching through logs

As soon as you log back into your Kibana dashboard for Amazon ES, you should see a new index appear called awswaf-<date>. You’ll need to define an index pattern that will encompass all the potential awswaf log entries. Configure your index pattern description as awswaf-*, and on the next screen select timestamp as the time filter.
 

Figure 9: Configuring the Kibana index pattern

Figure 9: Configuring the Kibana index pattern

 

Figure 10: Configuring a time filter for the index pattern

Figure 10: Configuring a time filter for the index pattern

After you complete the setup of your index pattern, you’ll be able to run searches through your logs by going into the Discover tab in Kibana. The searches can be based on any fields in the AWS WAF logs. For example, you can look for specific HTTP headers, query strings, or source IP addresses to find out what action has been applied to them. To find out more about what fields are available in the AWS WAF logs, see the AWS WAF Developer Guide.

In the example search that follows, I’m looking for all of the log entries where a web ACL has blocked the traffic and the HTTP arguments were equal to name=badname.
 

Figure 11: An example ad-hoc search

Figure 11: An example ad-hoc search

This approach can help you identify specific request based on their parameters, such as a cookie, host header, query string, and understand why they are being blocked (or allowed) and by which web ACL. It is especially useful for tuning AWS WAF to match your web application.

Building a dashboard

You can also use Amazon ES to create custom graphs to look at long-term trends—and you can combine your graphs into a dashboard. For example, you may want to see the number of blocked and allowed requests by IP address to identify potentially hostile IP addresses.

To understand the breakdown of the requests per IP address, you should first set up a graph showing the number of requests that were blocked and allowed by AWS WAF per source. To do this, go to the Visualize section of your Kibana dashboard and create a new visualization. You can select from a variety of visualization options, but I’m going to use the horizontal bar chart for the example that follows.
 

Figure 12: Configuring primary bucket parameters

Figure 12: Configuring primary bucket parameters

  1. Select your index and then configure your graph options. For a horizontal bar chart, leave the Y axis to represent the count of requests, but expand the primary bucket (X-Axis) form and change the fields to show the source IP address and action:
    1. From the Aggregation dropdown, select Terms.
    2. From the Field dropdown, select httpRequest.clientIP.
    3. From the Order By dropdown, select metric: Count.
       
      Figure 13: Configuring bucket series

      Figure 13: Configuring bucket series

  2. Next, expand the Split Series section and adjust the following dropdown fields:
    1. From the Sub Aggregation dropdown, select Terms.
    2. From the Field dropdown select action.keyword.
    3. From the Order By dropdown, select metric: Count.
       
      Figure 14: Sample Kibana graphs

      Figure 14: Sample Kibana graphs

    4. Apply your changes by selecting the play button at the top of the page. You should see the results appear on the right side of the screen.
       
      Figure 15: Select the "Apply changes" button

      Figure 15: Select the “Apply changes” button

    You can combine graphs into dashboards that aggregate relevant information about AWS WAF functionality. In the sample dashboard below, I’m showing multiple visualisations focusing on different dimensions like request per country, IP address, and even which WAF rules get the most hits.
     

    Figure 16: You can combine graphs

    Figure 16: You can combine graphs

    I’ve made the sample dashboard and sample visualizations available for download. They can be imported to your Kibana instance in the Management/SavedObjects section of the dashboard.

    Conclusion

    The ability to access AWS WAF logs for any web ACL allows you to start analyzing the requests coming to your web applications. In this post, I’ve covered an example of how you can use Amazon Elasticsearch Service as a destination for AWS WAF logs and shown you how to run searches on the log data as well as build graphs and dashboards. Amazon ES is just one of the destination options for native log delivery. To learn more, please refer to the documentation on how to send data directly to Amazon S3, Amazon Redshift, and Splunk.

    If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the WAF forum or contact AWS Support.

    Want more AWS Security news? Follow us on Twitter.

    Tom Adamski

    Tom is a Specialist Solutions Engineer in Networking. He helps customers across the EMEA region design their network environments on AWS. He has over 10 years of experience in the networking and security industries. In his spare time, he’s an avid surfer who’s always on the lookout for cold water surf breaks.

Viewing Amazon Elasticsearch Service Error Logs

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/viewing-amazon-elasticsearch-service-error-logs/

Today, Amazon Elasticsearch Service (Amazon ES) announces support for publishing error logs to Amazon CloudWatch Logs.  This new feature provides you with the ability to capture error logs so you can access information about errors and warnings raised during the operation of the service. These details can be useful for troubleshooting. You can then use this information to work with your users to identify patterns that cause error or warning scenarios on your domain.

Access to the feature is enabled as soon as your domain is created.

You can turn the logs on and off at will, paying only for the CloudWatch charges based on their usage.

Set up delivery of error logs for your domain

To enable error logs for an active domain, sign in to the AWS Management Console and choose Elasticsearch Service. On the Amazon ES console, choose your domain name in the list to open its dashboard. Then choose the Logs tab.

In this pane, you configure your Amazon ES domain to publish search slow logs, indexing slow logs, and error logs to a CloudWatch Logs log group. You can find more information on setting up slow logs in the blog post Viewing Amazon Elasticsearch Service Slow Logs on the AWS Database Blog.

Under Set up Error Logs, choose Setup.

You can choose to Create new log group or Use existing log group. We recommend naming your log group as a path, such as:

/aws/aes/domains/mydomain/application-logs/

This naming scheme makes it easier to apply a CloudWatch access policy, in which you can grant permissions to all log groups under a specific path, such as:

/aws/aes/domains

To deliver logs to your CloudWatch Logs group, you need to specify a policy for Amazon ES so it can publish to CloudWatch Logs on your behalf.  You can choose to Create a new policy or Select an existing policy. You can accept the policy as is. Or, if your log group names are paths, you can widen the Resource—for example:

arn:aws:logs:us-east-1:123456789012:log-group:/aws/aes/domains/*

You can then reuse this policy for all your domains.

Once you have saved the policy for the domain, Choose Enable, and you have completed setup. Your domain can now send error logs to CloudWatch Logs.

Now that you have enabled the publishing of error logs, you can start monitoring them.

Types of events captured

Elasticsearch uses Apache Log4j 2 and its built-in log levels (from least to most severe) of TRACE, DEBUG, INFO, WARN, ERROR, and FATAL. After you enable error logs, Amazon ES publishes log lines of WARN, ERROR, and FATAL to CloudWatch. Less severe levels (INFO, DEBUG and TRACE) are not available.

Based on this, you can expect to find details for events such as the ones highlighted in the following list.

  • Rejects based on exceeding the configured highlight.max_analyzed_offset parameter limit
  • Painless script compilation issues in a request
  • Detailed information about invalid requests and invalid query formats
  • GC cycles
  • Detailed information about write blocks
  • Issues encountered during snapshot exercises

View your log data

To see your log data, sign in to the AWS Management Console, and open the CloudWatch console. In the left navigation pane, choose the Logs tab. Find your log group in the list of groups and open the log group. Your log group name is the Name that you set when you set up logging in the Amazon ES wizard.

Within your log group, you should see a number of log streams.

Amazon ES creates es-test-log-stream during setup of error logs to ensure that it can write to CloudWatch Logs. This stream contains only a single test message.

Your application error logs arrive within 30 minutes and have long hex names, suffixed by es-application-logs to indicate the source of the log data. Choose one of these to view events based on the last event time.

You should see individual entries for each event in timestamp order. To switch from less granular detail to highly granular detail on the event log entry, you can use a toggle at the top right of the CloudWatch Logs console. The format is a timestamp, the locus, the node generating the error or warning, and text cleansed of any specifics on the cluster itself such as you see in this stack trace.

Conclusion

By enabling the error logs feature, you can gain more insight into issues with your Amazon ES domains and identify issues with domain configurations.  Additionally, you can also use the integration of CloudWatch Logs and Amazon ES to send application error logs to a different Amazon ES domain and monitor your domain’s performance.


About the Author

Kevin Fallis is an AWS solutions architect specializing in search technologies.

 

 

 

 

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

AWS AppSync – Production-Ready with Six New Features

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-appsync-production-ready-with-six-new-features/

If you build (or want to build) data-driven web and mobile apps and need real-time updates and the ability to work offline, you should take a look at AWS AppSync. Announced in preview form at AWS re:Invent 2017 and described in depth here, AWS AppSync is designed for use in iOS, Android, JavaScript, and React Native apps. AWS AppSync is built around GraphQL, an open, standardized query language that makes it easy for your applications to request the precise data that they need from the cloud.

I’m happy to announce that the preview period is over and that AWS AppSync is now generally available and production-ready, with six new features that will simplify and streamline your application development process:

Console Log Access – You can now see the CloudWatch Logs entries that are created when you test your GraphQL queries, mutations, and subscriptions from within the AWS AppSync Console.

Console Testing with Mock Data – You can now create and use mock context objects in the console for testing purposes.

Subscription Resolvers – You can now create resolvers for AWS AppSync subscription requests, just as you can already do for query and mutate requests.

Batch GraphQL Operations for DynamoDB – You can now make use of DynamoDB’s batch operations (BatchGetItem and BatchWriteItem) across one or more tables. in your resolver functions.

CloudWatch Support – You can now use Amazon CloudWatch Metrics and CloudWatch Logs to monitor calls to the AWS AppSync APIs.

CloudFormation Support – You can now define your schemas, data sources, and resolvers using AWS CloudFormation templates.

A Brief AppSync Review
Before diving in to the new features, let’s review the process of creating an AWS AppSync API, starting from the console. I click Create API to begin:

I enter a name for my API and (for demo purposes) choose to use the Sample schema:

The schema defines a collection of GraphQL object types. Each object type has a set of fields, with optional arguments:

If I was creating an API of my own I would enter my schema at this point. Since I am using the sample, I don’t need to do this. Either way, I click on Create to proceed:

The GraphQL schema type defines the entry points for the operations on the data. All of the data stored on behalf of a particular schema must be accessible using a path that begins at one of these entry points. The console provides me with an endpoint and key for my API:

It also provides me with guidance and a set of fully functional sample apps that I can clone:

When I clicked Create, AWS AppSync created a pair of Amazon DynamoDB tables for me. I can click Data Sources to see them:

I can also see and modify my schema, issue queries, and modify an assortment of settings for my API.

Let’s take a quick look at each new feature…

Console Log Access
The AWS AppSync Console already allows me to issue queries and to see the results, and now provides access to relevant log entries.In order to see the entries, I must enable logs (as detailed below), open up the LOGS, and check the checkbox. Here’s a simple mutation query that adds a new event. I enter the query and click the arrow to test it:

I can click VIEW IN CLOUDWATCH for a more detailed view:

To learn more, read Test and Debug Resolvers.

Console Testing with Mock Data
You can now create a context object in the console where it will be passed to one of your resolvers for testing purposes. I’ll add a testResolver item to my schema:

Then I locate it on the right-hand side of the Schema page and click Attach:

I choose a data source (this is for testing and the actual source will not be accessed), and use the Put item mapping template:

Then I click Select test context, choose Create New Context, assign a name to my test content, and click Save (as you can see, the test context contains the arguments from the query along with values to be returned for each field of the result):

After I save the new Resolver, I click Test to see the request and the response:

Subscription Resolvers
Your AWS AppSync application can monitor changes to any data source using the @aws_subscribe GraphQL schema directive and defining a Subscription type. The AWS AppSync client SDK connects to AWS AppSync using MQTT over Websockets and the application is notified after each mutation. You can now attach resolvers (which convert GraphQL payloads into the protocol needed by the underlying storage system) to your subscription fields and perform authorization checks when clients attempt to connect. This allows you to perform the same fine grained authorization routines across queries, mutations, and subscriptions.

To learn more about this feature, read Real-Time Data.

Batch GraphQL Operations
Your resolvers can now make use of DynamoDB batch operations that span one or more tables in a region. This allows you to use a list of keys in a single query, read records multiple tables, write records in bulk to multiple tables, and conditionally write or delete related records across multiple tables.

In order to use this feature the IAM role that you use to access your tables must grant access to DynamoDB’s BatchGetItem and BatchPutItem functions.

To learn more, read the DynamoDB Batch Resolvers tutorial.

CloudWatch Logs Support
You can now tell AWS AppSync to log API requests to CloudWatch Logs. Click on Settings and Enable logs, then choose the IAM role and the log level:

CloudFormation Support
You can use the following CloudFormation resource types in your templates to define AWS AppSync resources:

AWS::AppSync::GraphQLApi – Defines an AppSync API in terms of a data source (an Amazon Elasticsearch Service domain or a DynamoDB table).

AWS::AppSync::ApiKey – Defines the access key needed to access the data source.

AWS::AppSync::GraphQLSchema – Defines a GraphQL schema.

AWS::AppSync::DataSource – Defines a data source.

AWS::AppSync::Resolver – Defines a resolver by referencing a schema and a data source, and includes a mapping template for requests.

Here’s a simple schema definition in YAML form:

  AppSyncSchema:
    Type: "AWS::AppSync::GraphQLSchema"
    DependsOn:
      - AppSyncGraphQLApi
    Properties:
      ApiId: !GetAtt AppSyncGraphQLApi.ApiId
      Definition: |
        schema {
          query: Query
          mutation: Mutation
        }
        type Query {
          singlePost(id: ID!): Post
          allPosts: [Post]
        }
        type Mutation {
          putPost(id: ID!, title: String!): Post
        }
        type Post {
          id: ID!
          title: String!
        }

Available Now
These new features are available now and you can start using them today! Here are a couple of blog posts and other resources that you might find to be of interest:

Jeff;

 

 

Real-Time Hotspot Detection in Amazon Kinesis Analytics

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/real-time-hotspot-detection-in-amazon-kinesis-analytics/

Today we’re releasing a new machine learning feature in Amazon Kinesis Data Analytics for detecting “hotspots” in your streaming data. We launched Kinesis Data Analytics in August of 2016 and we’ve continued to add features since. As you may already know, Kinesis Data Analytics is a fully managed real-time processing engine for streaming data that lets you write SQL queries to derive meaning from your data and output the results to Kinesis Data Firehose, Kinesis Data Streams, or even an AWS Lambda function. The new HOTSPOT function adds to the existing machine learning capabilities in Kinesis that allow customers to leverage unsupervised streaming based machine learning algorithms. Customers don’t need to be experts in data science or machine learning to take advantage of these capabilities.

Hotspots

The HOTSPOTS function is a new Kinesis Data Analytics SQL function you can use to idenitfy relatively dense regions in your data without having to explicity build and train complicated machine learning models. You can identify subsections of your data that need immediate attention and take action programatically by streaming the hotspots out to a Kinesis Data stream, to a Firehose delivery stream, or by invoking a AWS Lambda function.

There are a ton of really cool scenarios where this could make your operations easier. Imagine a ride-share program or autonomous vehicle fleet communicating spatiotemporal data about traffic jams and congestion, or a datacenter where a number of servers start to overheat indicating an HVAC issue. HOTSPOTS is not limited to spatiotemporal data and you could apply it across many problem domains.

The function follows some simple syntax and accepts the DOUBLE, INTEGER, FLOAT, TINYINT, SMALLINT, REAL, and BIGINT data types.

The HOTSPOT function takes a cursor as input and returns a JSON string describing the hotspot. This will be easier to understand with an example.

Using Kinesis Data Analytics to Detect Hotspots

Let’s take a simple data set from NY Taxi and Limousine Commission that tracks yellow cab pickup and dropoff locations. Most of this data is already on S3 and publicly accessible at s3://nyc-tlc/. We will create a small python script to load our Kinesis Data Stream with Taxi records which will feed our Kinesis Data Analytics. Finally we’ll output all of this to a Kinesis Data Firehose connected to an Amazon Elasticsearch Service cluster for visualization with Kibana. I know from living in New York for 5 years that we’ll probably find a hotspot or two in this data.

First, we’ll create an input Kinesis stream and start sending our NYC Taxi Ride data into it. I just wrote a quick python script to read from one of the CSV files and used boto3 to push the records into Kinesis. You can put the record in whatever way works for you.

 

import csv
import json
import boto3
def chunkit(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

kinesis = boto3.client("kinesis")
with open("taxidata2.csv") as f:
    reader = csv.DictReader(f)
    records = chunkit([{"PartitionKey": "taxis", "Data": json.dumps(row)} for row in reader], 500)
    for chunk in records:
        kinesis.put_records(StreamName="TaxiData", Records=chunk)

Next, we’ll create the Kinesis Data Analytics application and add our input stream with our taxi data as the source.

Next we’ll automatically detect the schema.

Now we’ll create a quick SQL Script to detect our hotspots and add that to the Real Time Analytics section of our application.

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    "pickup_longitude" DOUBLE,
    "pickup_latitude" DOUBLE,
    HOTSPOTS_RESULT VARCHAR(10000)
); 
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" 
    SELECT "pickup_longitude", "pickup_latitude", "HOTSPOTS_RESULT" FROM
        TABLE(HOTSPOTS(
            CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001"),
            1000,
            0.013,
            20
        )
    );


Our HOTSPOTS function takes an input stream, a window size, scan radius, and a minimum number of points to count as a hotspot. The values for these are application dependent but you can tinker with them in the console easily until you get the results you want. There are more details about the parameters themselves in the documentation. The HOTSPOTS_RESULT returns some useful JSON that would let us plot bounding boxes around our hotspots:

{
  "hotspots": [
    {
      "density": "elided",
      "minValues": [40.7915039, -74.0077401],
      "maxValues": [40.7915041, -74.0078001]
    }
  ]
}

 

When we have our desired results we can save the script and connect our application to our Amazon Elastic Search Service Firehose Delivery Stream. We can run an intermediate lambda function in the firehose to transform our record into a format more suitable for geographic work. Then we can update our mapping in Elasticsearch to index the hotspot objects as Geo-Shapes.

Finally, we can connect to Kibana and visualize the results.

Looks like Manhattan is pretty busy!

Available Now
This feature is available now in all existing regions with Kinesis Data Analytics. I think this is a really interesting new feature of Kinesis Data Analytics that can bring immediate value to many applications. Let us know what you build with it on Twitter or in the comments!

Randall