Tag Archives: Amazon Elasticsearch Service

A deep dive into high-cardinality anomaly detection in Elasticsearch

Post Syndicated from Kaituo Li original https://aws.amazon.com/blogs/big-data/a-deep-dive-into-high-cardinality-anomaly-detection-in-elasticsearch/

In May 2020, we announced the general availability of real-time anomaly detection for Elasticsearch. With that release we leveraged the Random Cut Forest (RCF) algorithm to identify anomalous behaviors in the multi-dimensional data streams generated by Elasticsearch queries. We focused on aggregation first, to enable our users to quickly and accurately detect anomalies in their data streams. However, consider the example data in the following table.

timestampavg. latencyregion
12:00 pm3.1Seattle
12:00 pm4.1New York
12:00 pm5.9Berlin
12:01 pm2.6Seattle
12:01 pm5.3New York
12:01 pm5.8Berlin

The data consists of one data field, avg. latency, and one attribute or categorical field, time. If we want to perform anomaly detection on this data, we could take the following strategy:

  1. Separate the data by the region attribute to create a separate data stream entity for each region.
  2. Construct an anomaly detector for each entity. If the cardinality of the region attribute, that is the number of possible choices of region value, is small, we can create separate anomaly detectors by filtering on each possible value of region.

But what if the cardinality of this attribute is large? Or what if the set of possible values changes over time, such as a source IP address or product ID? The existing anomaly detection tool doesn’t scale well in this situation.

We define the high-cardinality anomaly detection (HCAD) problem as performing anomaly detection on a data stream where individual entities in the stream are defined by a choice of attribute. In this use case, our goal is to perform anomaly detection on each data stream defined by a particular choice of region. That is, the Seattle region produces its own latency data stream, as well as the New York and Berlin regions.

In this post, we dive into the motivation, design, and development of the HCAD capability. We begin with an in-depth description of the HCAD problem and its properties. We then share the details of our solution and the challenges and questions we encountered during our research and development. Finally, we describe the system and architecture of our solution, especially the components tackling scalability concerns.

High-cardinality anomaly detection

In this section, we elaborate on the definition of the HCAD problem. As described earlier, we can think of HCAD as a way to produce multiple data streams by defining each data stream by a particular choice of attribute. For each data stream, we want to perform streaming anomaly detection as usual—we want to detect anomalies relative to that individual data stream’s own history.

We define an individual entity by fixing a particular value of one or more attributes and aggregating values over all remaining attributes. In the language of table manipulation in SQL, each entity is defined using GROUPBY on an attribute. Specifically, a group of entities is defined by selecting one or more attributes, where each entity is given by the data fields for each particular value of those attributes. For example, applying GROUPBY to the region attribute in the preceding example data produces three data stream entities: one for Seattle, one for New York, and one for Berlin. Within each of these data streams, we want to find anomalies with respect to that data stream’s history.

This idea of defining a data stream entity by attribute values extends to multiple attribute fields. When multiple attribute fields exist but we group by only one of those attributes, we aggregate each entity over the remaining attribute fields. For example, suppose you have network traffic data consisting of two attribute fields and one data field. The attribute fields are a source_ip address and a dest_ip address. The data field is the number of bytes_transferred in that particular network transaction between the given two IP addresses. The following table gives an example of such a dataset.

timestampsource_ipdest_ipbytes_transferred
12:00 pm192.168.1.1192.168.1.20138
12:00 pm192.168.1.1192.168.1.30021
12:00 pm192.168.1.20192.168.1.3005
12:01 pm192.168.1.1192.168.1.20289
12:01 pm192.168.1.1192.168.1.30010
12:02 pm192.168.1.1192.168.1.20244
12:02 pm192.168.1.1192.168.1.30016
12:02 pm192.168.1.20192.168.1.3008

One way to define an entity is to group by both source IP address and destination IP address combinations. Under this method of defining entities, we end up with the following data streams.

entity: (source_ip, dest_ip)12:00 pm12:01 pm12:02 pm12:03 pm
(192.168.1.1, 192.168.1.20)138289244
(192.168.1.1, 192.168.1.300)251016
(192.168.1.20, 192.168.1.300)508

On the other hand, if we define an entity only by its source IP address, we aggregate bytes transferred over the possible destination IP addresses.

entity: (source_ip,)12:00 pm12:01 pm12:02 pm12:03 pm
(192.168.1.1,)159299260
(192.168.1.20,)508

HCAD is distinct from another anomaly detection technique called population analysis. The goal of population analysis is to discover entire entities with values and patterns distinct from other entities. For example, the bytes transferred data stream associated with the entity (192.168.1.1, 192.168.1.20) is much larger in value than either of the other entities. Assuming many entities exist with values in the range of 1 to 30, this entity is considered a population anomaly. An entity can be a population data stream contains no anomalies relative to its own history.

Depending on the way we define entities from attributes, the number of data stream entities changes. This is an important consideration with regards to scale and density of the data streams: grouping by too many attributes may leave you with entities that have too few observations for a meaningful data stream. This is not uncommon in real-world datasets. Even in a dataset with only one attribute, real-world data tends to adhere to a power-law scaling of data density. Simply put, the majority of data stream activity occurs in a minority of entities. There is likely a long tail of sparse entities. Given this observation, if the stream aggregation window is too small, there are many missing data points in these sparse entities.

Data stream models for HCAD

We described the HCAD problem, but how do we build a machine learning solution? Furthermore, how is this solution different from the currently available non-HCAD single-stream solution? In this section, we explain our process for model selection and why we arrived at using Random Cut Forests for the high-cardinality regime. We then address scalability problems by exploring RCF’s hyperparameter space. Finally, we address certain issues that arise when dealing with sparse data streams.

Model selection

Designing an HCAD solution has several scientific challenges. Whatever algorithmic solution we arrive at must satisfy several systems constraints:

  • The algorithm must work in a streaming context: aggregated feature queries are streaming in Elasticsearch and the anomaly detection models only receive each new feature aggregate one at a time
  • The HCAD solution must respect the business needs of the customer hardware and should have restricted CPU and memory impact
  • The solution should be scalable with respect to data throughput, number of entities, and number of nodes in the cluster
  • The algorithm must be unsupervised, because the goal is to classify anomalous data in a streaming context without any labeled training set

Our team identified three classes of anomaly detection model based on the relationship between number of entities and number of models:

  • 1:1 model – Each entity is given its own AD model. No data or anomaly information is shared between the models, but because the number of models scales with the number of entities, we must keep the model small to satisfy customer scaling needs.
  • N:1 model – A single AD model is responsible for detecting each entity’s anomalies. Deep learning-based AD models typically fall under this category.
  • N:K model – A subset of entities is assigned to one of several individual models. Typically, some clustering algorithm is used to determine an appropriate partition of entities by identifying common features in the data streams.

Each general class of solution has its own tradeoffs with respect to the ability to distribute across cluster nodes, scale with respect to the number of entities, and detect anomalies on benchmark datasets. After some analysis of these tradeoffs and experimentation, we decided on the 1:1 approach. Within this class of HCAD solution, there are many candidate data stream anomaly detection algorithms. We explored many of these algorithms and tested different lightweight models before deciding on using Random Cut Forests. RCF works particularly well across a wide variety of data stream behaviors. This fit well with our goal of providing support for as wide of a range of customer use cases as possible.

Scaling Random Cut Forests

To keep memory costs down when using RCFs as our AD model, we started by exploring the algorithm’s hyperparameter space. The model has three main hyperparameters:

  • T – Number of trees
  • S – Sample size per free
  • D – Shingle dimension

The RCF model size is O(TDS). Sample size per tree is related to expected anomaly rate, and based on our experiments with a wide variety of datasets, it was best to leave this hyperparameter at its default value of 256 from the single-stream solution. The dimensionality is a function of the customer input but also of the model’s shingle size. We discuss the role of shingle size in the next section. Primarily to satisfy the scaling and model size constraints of the HCAD system, we focused on studying the effect of the number of trees on algorithm performance.

Experiments show that 10 trees per forest gives acceptable results on benchmark datasets; a default number of 100 trees is used in the single-stream solution. In the original plugin, we chose this large number of trees to ensure that the model can keep an accurate sketch of a long enough period of data samples. In doing so, we can recognize long time-scale changes to the data stream. However, we found in our benchmark high-cardinality data streams that this large of a model is unnecessary and that 10 trees is often sufficient for summarizing each high-cardinality data stream’s statistics.

Our experiments measured the precision and recall on labeled data streams. Labels were of the form of anomaly windows: regions in time where an anomaly is known to occur at some point inside the window. A true positive is the positive identification of such a window by the anomaly detection method. Any positively predicted point outside a window is considered a false positive. For an example labeled dataset, see Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service.

Handling sparse data streams

As mentioned earlier, real-world high-cardinality datasets typically exhibit a power-law like distribution in entity activity. That is, a minority of the entities produce the majority of the data. The earlier source and destination IP address use case is an example: for many websites, the majority of traffic comes from a small collection of sources, whereas individual visitors make up a long tail of sparse activity. Under this assumption, the choice of shingle size is important in defining our entity data streams.

Shingling is a standard preprocessing technique for transforming a one-dimensional data stream xt​ into a d-dimensional data stream st​ by converting subsequences of length d into d-dimensional vectors: st = (xt−d+1​, …, xt−1​, xt​). The following diagram illustrates the shingling process using a shingle size of four.

These vectors, instead of the raw stream values, are then fed into the RCF model. In anomaly detection, using shingling has several benefits. First, shingling filters out small-scale noise in the data. Second, shingles allow the model to detect breaks in certain local patterns or frequency changes. That is, a shingled RCF model learns some of the local temporal behavior of your data stream.

From discussions with our customers and analysis of real-world anomalies, we realized that many customers are looking for distributional anomalies: values that are outside the normal range of values of a data stream. This is in contrast to contextual anomalies, where a data point is considered anomalous in the context of just the data stream’s local history. The following figure depicts this distinction. On the left is a plot of a data stream, and on the right is a histogram of the values attained by this stream in the time window shown. The red data point is a distributional anomaly because its value falls within a low-density regime of the value distribution. The orange data point, on the other hand, is a contextual anomaly: its value is commonly occurring within this span of time but the presence of a spike at this particular point in time is unexpected.

The use of a shingle dimension greater than one allows the RCF model to detect these contextual anomalies in addition to the distributional anomalies.

One challenge with using shingles, however, is how to handle missing data. When data is unavailable at a particular time t, the shingles at times t, t+1, …, t+d−1 cannot be constructed. This results in a delay in the model’s ability to report anomalies. Our the impact of the occasional missing datum by using interpolation. However, when a data stream is sparse, it’s unlikely that any shingle can be constructed, thus turning interpolation into a prediction problem. Whether or not shingling is appropriate for your data is a function of the aggregation window used in the Elasticsearch query and the entity data density.

Scaling anomaly detection in Elasticsearch

In this section, we deep dive into the engineering challenges encountered in building the HCAD tool, particularly regarding the scalability with respect to the number of entities. We first describe the challenges we faced. Then we explain how our HCAD solution balances scalability and resource usage. Finally, we collected these ideas into a description of the overall HCAD framework.

The challenge

As described earlier, our goal was to support filtering the data by attribute or categorical fields and create a separate model for each attribute or categorical value. After examining several real-world use cases, we needed the HCAD plugin to handle millions of categorical values. Processing this many unique values was a challenging scalability issue that affected several key resources:

  • Storage – At the extreme, with 100 1-minute interval detectors and millions of entities for each detector running on our evaluation workload, we have seen the checkpoint index reach up to 170 GB in 1 day.
  • Memory – Compared to the single-stream detector, we could decrease the model size by approximately 20 times by decreasing shingle size and the number of RCF trees. But the number of entities is unbounded.
  • CPU – A single-stream detector mostly runs serial processing. During an HC detector run, multiple entities compete for CPU cycles for model update and inference. The CPU time grows linearly relative to the number of entities processed in each AD job run.

Designing for scalability and resource control

Based on these scalability issues, we chose to extend the current AD architecture because it already had these attributes:

  • Easy to scale out
  • Powerful enough to handle unpredictable scaling requirements
  • Able to control resource usage

However, meeting these challenges for HCAD required three key changes to our existing AD architecture.

First, we placed embarrassingly parallel computations on multiple nodes instead of a coordinating node. The coordinating node acts as the start of the task workflow. It only fetches features and assigns each node in the cluster a portion of the features that is roughly the same in size for all nodes. Other nodes process the features, train and run local models, and write results. Therefore, increasing the number of nodes by a factor of K asymptotically increases the number of categorical values we can handle by the same factor.

Second, in a single-stream detector, the amount of memory used is proportional to the number of features and is fixed when the detector is defined. However, with the introduction of HCAD, the number of entities is not fixed and the number of active entities is likely to change. Therefore, the size of the required memory may continuously change in the lifetime of a detector. Caching can accommodate such requirements without the need to pre-allocate memory for a detector in a fixed amount. If enough memory exists, we create models for all entities and monitor anomalies. Otherwise, we cache the hottest entities’ models up to the amount that the cache memory can contain. For example, if our memory can host only 100 models and there are millions of entities, the maximum active entities in the cache are the hottest 100 entities. We maintain a time-decayed count of each entity. The cache uses this information to measure an entity’s hotness.

Finally, we implemented various strategies for combating the extra overhead of running HC detectors:

  • Rate limiting – We limit concurrent computations and throttle bursty usage. For example, when replacing models in the cache, the cache sends get and search requests to fetch and potentially train models. If there is bursty traffic to replace models, the number of requests might exceed Elasticsearch’s get and search thread pool’s maximum queue size and cause Elasticsearch to reject all get and search requests. We install rate limiting to restrict models’ replacing speed.
  • Active cleanup – This keeps resource usage under a safe level before it’s too late. For example, we keep checkpoints within 3 days. When any of the checkpoint shards is larger than 50 GB (recommended maximum shard size), we start deleting checkpoints more aggressively.
  • Minimizing space usage – For example, in single-stream anomaly detection, we record a model’s running results during each interval. An entity’s model may take time to get ready when there is not enough historical data for training. We don’t need to record such entities’ results because we won’t record anything useful other than that anomaly grade and confidence are both equal to zero. This optimization can reduce the result index size by 4–8 times in one of our experiments.

Architecture

The following figure summarizes the HCAD architecture.

The end-to-end story of HCAD is as follows:

  1. A user wants to get alerts when an anomaly for a particular entity in the whole corpus arises (for example, high CPU usage on a host).
  2. The user creates an HCAD detector to describe the source data (index name), feature (for example, average CPU usage within an interval), and sampling frequency (for example, 1 minute).
  3. Based on the detector configuration, the AD plugin issues a query to fetch feature data for each host regularly (every 1 minute). Users don’t need to know what hosts to query for in the first place.
  4. A coordinating node infers the entities from the query result.
  5. The coordinating node distributes entities’ features to all nodes in the cluster.
  6. On each node, models are trained for the incoming entities, and anomaly grades are inferred, indicating how different the current CPU usage is from the trends that have recently been observed for the same hosts’ CPU usage.
  7. If cache memory is enough for all incoming entities, the cache admits entities’ models based on the entities’ hotness.

The Kibana workflow

In this section, we show how to use the HCAD in Kibana. Let’s imagine that we need to monitor the high or low CPU usage of our hosts. To do that, we create a detector, define its features, and choose a category field.

Creating a detector

To create and configure a detector, complete the following steps:

  1. On the navigation bar, choose Anomaly detection.
  2. Choose Create detector.

  1. Enter a name and description for the detector.

  1. Choose index or enter index pattern for the data source.
  2. For Timestamp field, choose a field so the detector can create a time series profile of the data.
  3. If you want the detector to ignore specific data (such as invalid CPU usage number), you can configure a data filter.

  1. Specify time frames for detection interval and window delay.

Window delay time should be a conservative estimate. Otherwise, the detector may query for documents within an interval that has not been indexed yet. For this post, we want to have an average CPU usage per minute, and we expect the index processing time to be 1 minute at most.

Defining features

In addition to the preceding settings, we need to add features. The detector aggregates a set of values within a time interval (shingle) to compute the single value according to the feature definition.

  1. Choose Configure model.

  1. For Feature name, enter a name.
  2. Specify your aggregation functions and fields.

We provide five built-in single metric aggregations: Min, Max, Sum, Average, and Count. You can add a customized aggregation by choosing Custom expression for Find anomalies based on. For this post, we add a feature that returns the average of CPU usage values.

As mentioned earlier, you can customize the aggregation method as long as it returns a single value. For example, when a DevOps engineer wants to monitor the count of distinct IPs accessing their company’s Amazon Simple Storage Service (Amazon S3) buckets, they can define a cardinality aggregation that counts unique source IPs.

Choosing a category field

The host-cloudwatch index in our example has CPU usage per host per minute. We can define a single-stream detector to model all of the hosts’ average CPU usage together. But if each host’s CPU values have different distributions, we can split the hosts’ time series and model them separately. Giving each categorical value a separate baseline is the main change that HCAD introduces.

Previewing and starting the detector

You might want to try out different choices of detector configurations and feature definitions before finalizing them. You can use Sample anomalies for iterative experiments.

Start the detector by choosing  Save and start detector. After confirming, the anomaly detector starts collecting data in real time and performing detection.

The detector starts in an initializing state.

We can use the profile API to check initialization progress (see the following code). A detector is initialized if its hottest entities’ models are fully initialized and ready to emit anomaly grade. Because the hottest entity may change, the initialization progress may go backward.

GET _opendistro/_anomaly_detection/detectors/stf6vnUB0XEggmgl3TCj/_profile/init_progress

{
    "init_progress": {
        "percentage": "92%",
        "estimated_minutes_left": 10,
        "needed_shingles": 10
    }
}

After the detector runs for a while, we can check its result on the detector’s Anomaly results tab. The following heatmap gives an overview of anomalies per entity across a timeline, by showing the hostname along the Y-axis and the timeline along the X-axis. A colored block means there is an anomaly, and a gray block means there is no anomaly.

Choosing one of the blocks shows you a more detailed view of the anomaly grade and confidence and the feature values causing the anomalies. We can observe the detector reports anomalies between 4:30 and 4:50 because the CPU usage is approaching 100%.

The time series of the host-cloudwatch index confirms host i-WrSNK7zgys has a CPU usage spike between 4:30–4:50.

We can set up alerts for the detection results. For instructions, see Anomaly Detection.

Conclusion

General-purpose anomaly detection is challenging. Earlier in 2020, we launched a tool that can find anomalies in your feature queries. In this work, we extended the anomaly detection capabilities to the high-cardinality case. We can now find anomalies within your data when the data contains attribute or categorical fields. Our solution can discover anomalies across many entities defined by these attribute values and also scale with respect to increasing or decreasing number of entities in your data. We look forward to hearing your questions, comments, and feedback.


About the Authors

Kaituo Li is an engineer in Amazon Elasticsearch Service. He has worked on distributed systems, applied machine learning, monitoring, and database storage in Amazon. Before Amazon, Kaituo was a PhD student in Computer Science at the University of Massachusetts Amherst. He likes reading, watching TV, and sports.

 

 

Chris Swierczewski is an applied scientist at AWS. He enjoys reading, weightlifting, painting, and board games.

Automating Index State Management for Amazon ES

Post Syndicated from Satya Vajrapu original https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-es/

When it comes to time-series data, it’s more common to access new data over existing data, such as the last 4 hours or 1 day. Often, application teams are tasked with maintaining multiple indexes for diverse data workloads, which brings new requirements to set up a custom solution to manage the indexes’ lifecycle. This becomes tedious as the indexes grow and result in housekeeping overheads.

Amazon Elasticsearch Service (Amazon ES) now enables you to automate recurring index management activities. This avoids using any additional tools to manage the index lifecycle inside Elasticsearch. With Index State Management (ISM), you can create a policy that automates these operations based on index age, size, and other conditions, all from within your Amazon ES domain.

In this post, I discuss how you can implement a sample policy to automate routine index management tasks and apply them to indexes and index patterns.

Prerequisites

Before you get started, make sure you complete the following prerequisites:

  1. Have Elasticsearch 6.8 or later (required to use ISM and Ultrawarm).
  2. Set up a new Amazon ES domain with UltraWarm enabled.
  3. Make sure your user role has sufficient permissions to access the Kibana console of the Amazon ES domain. If required, validate and configure the access to your domains.

Use case

Ultrawarm for Amazon ES is a new low-cost storage tier that provides fast, interactive analytics on up to three petabytes of log data at one-tenth of the cost of the current Amazon ES storage tier. Although hot storage is used for indexing and providing fastest access, Ultrawarm complements the hot storage tier by providing less expensive storage for older and less-frequently accessed data, all while maintaining the same interactive analytics experience. Rather than attached storage, UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) and a sophisticated caching solution to improve performance.

To demonstrate the functionality, I present a sample use case of handling time-series data. In this use case, we migrate a set of existing indexes that are initially in hot state and migrate them to Ultrawarm storage after a day. Upon migration, the data is stored in a service-managed S3 bucket as read only. We then delete the index after 2 days, assuming that index is no longer needed.

After we create the Amazon ES domain, complete the following steps:

  1. Log in using the Kibana UI endpoint.
  2. Wait for the domain status to turn active and choose the Kibana endpoint.
  3. On Kibana’s splash page, add all the sample data listed by choosing Try our sample data and choosing Add data.
  4. After adding the data, choose Index Management (the IM icon on the left navigation pane), which lands into the Index Policies page.
  5. Choose Create policy.
  6. For Name policy, enter ism-policy-sample.
  7. Replace the default policy with the following code:
    {
        "policy": {
            "description": "Lifecycle Management Policy",
            "default_state": "hot",
            "states": [
                {
                    "name": "hot",
                    "actions": [],
                    "transitions": [
                        {
                            "state_name": "warm",
                            "conditions": {
                                "min_index_age": "1d"
                            }
                        }
                    ]
                },
                {
                    "name": "warm",
                    "actions": [
                        {
                            "retry": {
                                "count": 5,
                                "backoff": "exponential",
                                "delay": "1h"
                            },
                            "warm_migration": {}
                        }
                    ],
                    "transitions": [
                        {
                            "state_name": "delete",
                            "conditions": {
                                "min_index_age": "2d"
                            }
                        }
                    ]
                },
                {
                    "name": "delete",
                    "actions": [
                        {
                            "notification": {
                                "destination": {
                                    "chime": {
                                        "url": "<CHIME_WEBHOOK_URL>"
                                    }
                                },
                                "message_template": {
                                    "source": "The index {{ctx.index}} is being deleted because of actioned policy {{ctx.policy_id}}",
                                    "lang": "mustache"
                                }
                            }
                        },
                        {
                            "delete": {}
                        }
                    ],
                    "transitions": []
                }
            ]
        }
    }
    

You can also use the ISM operations to programmatically work with policies and managed indexes. For example, to attach an ISM policy to an index at the time of creation, you invoke an API action. See the following code:

PUT index_1
{
  "settings": {
    "opendistro.index_state_management.policy_id": "ingest_policy",
    "opendistro.index_state_management.rollover_alias": "some_alias"
  },
  "aliases": {
    "some_alias": {
      "is_write_index": true
    }
  }
}

In this case, the ingest_policy is applied to index_1 with the rollover action defined in some_alias. For the list of complete ISM programmatic operations to work with policies and managed policies, see ISM API.

  1. Choose Create. You can now see your index policy on the Index Policies page.
  2. On the Indices page, search for kibana_sample, which should list all the sample data indexes you added earlier.
  3. Select all the indexes and choose Apply policy.
  4. From the Policy ID drop-down menu, choose the policy created in the previous step.
  5. Choose Apply.

The policy is now assigned and starts managing the indexes. On the Managed Indices page, you can observe the status as Initializing.

When initialization is complete, the status changes to Running.

You can also set a refresh frequency to refresh the managed indexes’ status information.

Demystifying the policy

In this section, I explain about the index policy tenets and how they’re structured.

Policies are JSON documents that define the following:

  • The states an index can be in
  • Any actions you want the plugin to take when an index enters the state
  • Conditions that must be met for an index to move or transition into a new state

The policy document begins with basic metadata like description, the default_state the index should enter, and finally a series of state definitions.

A state is the status that the managed index is currently in. A managed index can only be in one state at a time. Each state has associated actions that are run sequentially upon entering a state and transitions that are checked after all the actions are complete.

The first state is hot. In this use case, no actions are defined in this hot state; the managed indexes land in this state initially and then transition to warm. Transitions define the conditions that need to be met for a state to change (in this case, change to warm after the index crosses 24 hours). See the following code:

            {
                "name": "hot",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },

We can quickly verify the states on the console. The current state is hot and attempts the transition to warm after 1 day. The transition typically completes within an hour and is reflected under the Status column.

The warm state has actions defined to move the index to Ultrawarm storage. When the actions run successfully, the state has another transition to delete after the index ages 2 days. See the following code:

            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "warm_migration": {}
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },

You can again verify the status of the managed indexes.

You can also verify this on the Amazon ES console under the Ultrawarm Storage usage column.

The third state of the policy document marks the indexes to delete based on the actions. This policy state assumes your index is non-critical and no longer receiving write requests; having zero replicas carries some risk of data loss.

The final delete state has two actions defined. The first action is self-explanatory; it sends a notification as defined in the message_template to the destination. See the following code:

            {
                "name": "delete",
                "actions": [
                    {
                        "notification": {
                            "destination": {
                                "chime": {
                                    "url": "<CHIME_WEBHOOK_URL>"
                                }
                            },
                            "message_template": {
                                "source": " The index {{ctx.index}} is being deleted because of actioned policy {{ctx.policy_id}}",
                                "lang": "mustache"
                            }
                        }
                    },
                    {
                        "delete": {}
                    }
                ],
                "transitions": []
            }

I have configured the notification endpoint to be on Amazon Chime <CHIME_WEBHOOK_URL>. For more information about using webhooks, see Webhooks for Amazon Chime.

You can also configure the notification to send to destinations like Slack or a webhook URL.

At this state, I have received the notification on the Chime webhook (see the following screenshot).

The following screenshot shows the index status on the console.

After the notification is successfully sent, the policy runs the next action in the state that is deleting the indexes. After this final state, the indexes no longer appear on the Managed Indices page.

Additional information on ISM policies

If you have an existing Amazon ES cluster with no Ultrawarm support (because of any missing prerequisite), you can use policy operations read_only and reduces_replicas to replace the warm state. The following code is the policy template for these two states:

            {
                "name": "reduce_replicas",
                "actions": [{
                  "replica_count": {
                    "number_of_replicas": 0
                  }
                }],
                "transitions": [{
                  "state_name": "read_only",
                  "conditions": {
                    "min_index_age": "2d"
                  }
                }]
            },
            {
                "name": "read_only",
                "actions": [
                    {
                        "read_only": {}
                      }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "3d"
                        }
                    }
                ]
            },

Summary

In this post, you learned how to use the index state management feature on Ultrawarm for Amazon ES. The walkthrough illustrated how to manage indexes using this plugin with a sample lifecycle policy.

For more information about the ISM plugin, see Index State Management. If you need enhancements or have other feature requests, please file an issue. To get involved with the project, see Contributing Guidelines.

A big takeaway for me as I evaluated the ISM plugin in Amazon ES was that the ISM plugin is fully compatible and works on Open Distro for Elasticsearch. For more information, see Index State Management in Open Distro for Elasticsearch. It can be useful for using Open Distro for Elasticsearch as an on-premises or internal solution while using a managed service for your production workloads.


About the Author

Satya Vajrapu is a DevOps Consultant with Amazon Web Services. He works with AWS customers to help design and develop various practices and tools in the DevOps toolchain.

Normalize data with Amazon Elasticsearch Service ingest pipelines

Post Syndicated from Vijay Injam original https://aws.amazon.com/blogs/big-data/normalize-data-with-amazon-elasticsearch-service-ingest-pipelines/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost-effectively at scale. Search and log analytics are the two most popular use cases for Amazon ES. In log analytics at scale, a common pattern is to create indexes from multiple sources. In these use cases, how can you ensure that all the incoming data follows a specific, predefined format if it’s operationally not feasible to apply checks in each data source? You can use Elasticsearch ingest pipelines to normalize all the incoming data and create indexes with the predefined format.

What’s an ingest pipeline?

An ingest pipeline lets you use some of your Amazon ES domain processing power to apply to a set of processors during indexing. Ingest pipeline applies processors in order, the output of one processor moving to the next processor in the pipe. You define a pipeline with the Elasticsearch _ingest API. The following screenshot illustrates this architecture.

To find the available ingest processors in your Amazon ES domain, enter the following code:

GET _ingest/pipeline/

Solution overview

In this post, we discuss three log analytics use cases where data normalization is a common technique.

We create three pipelines and normalize the data for each use case. The following diagram illustrates this architecture.

Use case 1

In this first use case, Amazon ES domain has three sources: logstash, Fluentd, and AWS Lambda. Your logstash source sends the data to an index with the name index-YYYY.MM.DD.HH (hours in the end). When you have an error in the Fluentd source, it creates the index named index-YYYY.MM.DD (missing the hours). Your domain creates indexes for both the formats, which is not what you intended.

One way to correct the index name is to calculate the hours of the ingested data and assign the value to the index. If you can’t identify any pattern, or identify further issues to the indexing name, you need to segregate the data to a different index (for example, format_error) for further analysis.

Use case 2

If your application uses time-series data and analyzes data from fixed time windows, your data sources can sometimes send data from a prior time window. In this use case, you need to check for the incoming data and discard data that doesn’t fit in the current time window.

Use case 3

In some use cases, the value for a key can contain large strings with common prefixes. End-users typically use wild card characters (*) with the prefix to search on these fields. If your application or Kibana dashboards contain several wild card queries, it can increase CPU utilization and overall search lateness. You can address this by identifying the prefixes from the values and creating a new field with the data type as a keyword. You can use Term queries for the keywords and improve search performance.

Pipeline 1: pipeline_normalize_index

The default pipeline for incoming data is pipeline_normalize_index. This pipeline performs the following actions:

  • Checks if the incoming data belongs to the current date.
  • Checks if the data has any errors in the index name.
  • Segregates the data:
    • If it doesn’t find any errors, it pushes the data to pipeline_normalize_data.
    • If it finds errors, it pushes the pipeline to pipeline_fix_index.

Checking the index date

In this step, you can create an index pipeline using a script processor, which lets you create a script and execute within the pipeline.

Use the Set processor to add _ingest.timestamp to doc_received_date and compare the index date to the document received date. The script processor lets you create a script using painless scripts. You can create a script to check if the index date matches the doc_received_date. The script processor let you access the ingest document using the ctx variable. See the following code:

       "set":{
            "field":"doc_received_date",
            "value":"{{_ingest.timestamp}}"
         }
      },
      {
         "script":{
            "lang":"painless",
            "source": """
                    DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy.MM.dd");
                    String dateandhour = ctx._index.substring(ctx._index.indexOf('-') + 1);
                    LocalDate indexdate = LocalDate.parse(dateandhour.substring(0, 10), formatter);
                    ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
                    LocalDate doc_received_date = zonedDateTime.toLocalDate();
                    if (doc_received_date.isEqual(indexdate)) {
                        ctx.index_purge = "N";
                    } else {
                        ctx.index_purge = "Y";
                    }
                    if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";
                    }
        """,

Checking for index name errors

You can use the same script processor from the previous step to check if the index name matches the format index-YYYY.MM.DD.HH or index-YYYY.MM.DD. See the following code:

if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";
                    }

Segregating the data

If the index date doesn’t match the _ingest.timestamp, you can drop the request using the drop processor. If the index name doesn’t match the format index-YYYY.MM.DD, you can segregate the data to pipeline pipeline_verify_index_date and proceed to the pipeline pipeline_normalize_data. If conditions aren’t met, you can proceed to the pipeline pipeline_indexformat_errors or assign a default index indexing_errors. If no are issues found, you proceed to the pipeline pipeline_normalize_data. See the following code:

 "pipeline":{
            "if":"ctx.indexformat_error == 'Y'",
            "name":"pipeline_fix_index_name"
         }
      },
      {
         "remove":{
            "field":[
               "doc_received_date",
               "index_purge",
               "indexformat_error"
            ],
            "ignore_missing":true
         }
      },
      {
         "pipeline":{
            "name":"pipeline_normalize_data"

The following code is an example pipeline:

PUT _ingest/pipeline/pipeline_normalize_index
{
   "description":"pipeline_normalize_index",
   "processors":[
      {
         "set":{
            "field":"doc_received_date",
            "value":"{{_ingest.timestamp}}"
         }
      },
      {
         "script":{
            "lang":"painless",
            "source": """
                    DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy.MM.dd");
                    String dateandhour = ctx._index.substring(ctx._index.indexOf('-') + 1);
                    LocalDate indexdate = LocalDate.parse(dateandhour.substring(0, 10), formatter);
                    ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
                    LocalDate doc_received_date = zonedDateTime.toLocalDate();
                    if (doc_received_date.isEqual(indexdate)) {
                        ctx.index_purge = "N";
                    } else {
                        ctx.index_purge = "Y";
                    }
                    if (dateandhour.length() > 10) {
                        ctx.indexformat_error = "N";
                    } else {
                        ctx.indexformat_error = "Y";
                    }
        """,
         "on_failure":[
            {
               "set":{
                  "field":"Amazon_es_PipelineError",
                  "value":"at Script processor - Purge older Index or Index date error"
               }
            },
            {
               "set":{
                  "field":"_index",
                  "value":"indexing_errors"
               }
            }
         ]
      },
      {
         "drop":{
            "if":"ctx.index_purge == 'Y'"
         }
      },
      {
         "pipeline":{
            "if":"ctx.indexformat_error == 'Y'",
            "name":"pipeline_fix_index_name"
         }
      },
      {
         "remove":{
            "field":[
               "doc_received_date",
               "index_purge",
               "indexformat_error"
            ],
            "ignore_missing":true
         }
      },
      {
         "pipeline":{
            "name":"pipeline_normalize_data"
         }
      }
   ]
}

Pipeline 2: pipeline_normalize_data

The pipeline pipeline_normalize_data fixes index data. It extracts the prefix from the defined field and creates a new field. You can use the new field for Term queries.

In this step, you can use a grok processor to extract prefixes from the existing fields and create a new field that you can use for term queries. The output of this pipeline creates the index. See the following code of an example pipeline:

PUT _ingest/pipeline/pipeline_normalize_data
{
  "description":"pipeline_normalize_data",
  "version":1,
  "processors":[
     {
        "grok":{
           "field":"application_type",
           "patterns":[
              "%{WORD:application_group}"
           ],
           "ignore_missing":true,
           "on_failure":[
              {
                 "set":{
                    "field":"Amazon_es_PipelineError",
                    "value":"application_type error"
                 }
              }
           ]
        }
     }
  ]
}

Pipeline 3: pipeline_fix_index

This pipeline fixes the index name. The indexing errors identified in pipeline_normalize_Index are the incoming data points for this pipeline. pipeline_fix_index extracts the hours from the _ingest.timestamp and appends it to the index name.

The index name errors identified from Pipeline 1 are the data source for this pipeline. You can use the script processor to write a painless script. The script extracts hours (HH) from the _ingest.timestamp and appends it to the _index. See the following code of the example pipeline:

PUT _ingest/pipeline/pipeline_fix_index_name
    {
      "description":"pipeline_fix_index_name",
      "processors":[
         {
            "set":{
               "field":"doc_received_date",
               "value":"{{_ingest.timestamp}}"
            }
         },
         {
            "script":{
               "lang":"painless",
               "source": 
               """
               ZonedDateTime zonedDateTime = ZonedDateTime.parse(ctx.doc_received_date, DateTimeFormatter.ISO_DATE_TIME);
               LocalDate doc_received_date = zonedDateTime.toLocalDate();
               String receiveddatehour = zonedDateTime.getHour().toString();
               if (zonedDateTime.getHour() < 10) {
                    receiveddatehour = "0" + zonedDateTime.getHour();
               }
               ctx._index = ctx._index + "." + receiveddatehour;
               """,
               "on_failure":[
                  {
                     "set":{
                        "field":"Amazon_es_PipelineError",
                        "value":"at Script processor - Purge older Index or Index date error"
                     }
                  },
                  {
                     "set":{
                        "field":"_index",
                        "value":"indexformat_errors"
                     }
                  }
               ]
            }
         },
         {
            "remove":{
               "field":[
                  "doc_received_date"
               ],
               "ignore_missing":true
            }
         },
         {
            "pipeline":{
               "name":"pipeline_normalize_data"
            }
         }
      ]
   }

Adding the default pipeline to the index template

After creating all the pipelines, add the default pipeline to the index template. See the following code:

"default_pipeline" : "pipeline_normalize_index"

Summary

You can normalize data, fix indexing errors, and segregate operation data and anomalies by using ingest pipelines. Although you can use one pipeline with several processors (depending on the use case), indexing pipelines provides an efficient way to utilize compute resources and operational resources by eliminating unwanted indexes.

 


About the Author

Vijay Injam is a Data Architect with Amazon Web Services.

 

 

 

 

 

Kevin Fallis is an AWS specialist search solutions architect. His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

  1. A Lambda function is invoked.
  2. AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
  3. AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
  4. The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:

JAVA_TOOL_OPTIONS=-Xlog:gc+metaspace,gc+heap,gc:stdout:time,tags

The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

  • A: the young generation
  • B: the old generation
  • C: the metaspace
  • D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

launch stack button

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
{
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"
}

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

  • In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
  • In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
  • The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
  • The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
  • The GC Duration show the sum of the duration of all displayed garbage collection activities.
  • The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags

The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

  • An S3 event from a new uploaded image triggers this function.
  • The function loads the image from S3, resizes it, and puts the resized version to S3.
  • The file size of the example image is close to 2.8MB.
  • The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)
[...]

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.

Conclusion

In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Analyzing Amazon S3 server access logs using Amazon ES

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/analyzing-amazon-s3-server-access-logs-using-amazon-es/

When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. With the advent of increased data storage needs, you can rely on Amazon S3 for a range of use cases and simultaneously looking for ways to analyze your logs to ensure compliance, perform the audit, and discover risks.

Amazon S3 lets you monitor the traffic using the server access logging feature. With server access logging, you can capture and monitor the traffic to your S3 bucket at any time, with detailed information about the source of the request. The logs are stored in the S3 bucket you own in the same Region. This addresses the security and compliance requirements of most organizations. The logs are critical for establishing baselines, analyzing access patterns, and identifying trends. For example, the logs could answer a financial organization’s question about how many requests are made to a bucket and who is making what type of access requests to the objects.

You can discover insights from server access logs through several different methods. One common option is by using Amazon Athena or Amazon Redshift Spectrum and query the log files stored in Amazon S3. However, this solution poses high latency with an exponential growth in volume. It requires further integration with Amazon QuickSight to add visualization capabilities.

You can address this by using Amazon Elasticsearch Service (Amazon ES). Amazon ES is a managed service that makes it easier to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with other AWS services such as Amazon S3 and Amazon Kinesis for loading streaming data into Amazon ES.

This post walks you through automating ingestion of server access logs from Amazon S3 into Amazon ES using AWS Lambda and visualizing the data in Kibana.

Architecture overview

Server access logging is enabled on source buckets, and logs are delivered to access log bucket. The access log bucket is configured to send an event to the Lambda function when a log file is created. On an event trigger, the Lambda function reads the file, processes the access log, and sends it to Amazon ES. When the logs are available, you can use Kibana to create interactive visuals and analyze the logs over a time period.

When designing a log analytics solution for high-frequency incoming data, you should consider buffering layers to avoid instability in the system. Buffering helps you streamline processes for unpredictable incoming log data. For such use cases, you can take advantage of managed services like Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Streaming services buffer data before delivering it to Amazon ES. This helps you avoid overwhelming your cluster with spiky ingestion events. Kinesis Data Firehose can reliably load data into Amazon ES. Kinesis Data Firehose lets you choose a buffer size of 1–100 MiBs and a buffer interval of 60–900 seconds when Amazon ES is selected as the destination. Kinesis Data Firehose also scales automatically to match the throughput of your data and requires no ongoing administration. For more information, see Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose.

The following diagram illustrates the solution architecture.

Prerequisites

Before creating resources in AWS CloudFormation, you must enable server access logging on the source bucket. Open the S3 bucket properties and look for Amazon S3 access and delivery bucket. See the following screenshot.

You also need an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. The user must have access to create IAM roles and policies via the CloudFormation template.

Setting up the resources with AWS CloudFormation

First, deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

  1. Sign in to the console and choose the Region of the bucket storing the access log. For this post, I use us-east-1.
  2. Launch the stack:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. On the Parameters page, enter the following parameters:
    1. VPC Configuration – Select any VPC that has at least two private subnets. The template deploys the Amazon ES service domain and Lambda within the VPC.
    2. Private subnets – Select two private subnets of the VPC. The route tables associated with subnets must have a NAT gateway configuration and VPC endpoint for Amazon S3 to privately connect the bucket from Lambda.
    3. Access log S3 bucket – Enter the S3 bucket where access logs are delivered. The template configures event notification on the bucket to trigger the Lambda function.
    4. Amazon ES domain name – Specify the Amazon ES domain name to be deployed through the template.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Acknowledge resource creation under Capabilities and transforms and choose Create.

The stack takes about 10–15 minutes to complete. The CloudFormation stack does the following:

  • Creates an Amazon ES domain with fine-grained access control enabled on it. Fine-grained access control is configured with a primary user in the internal user database.
  • Creates IAM role for the Lambda function with required permission to read from S3 bucket and write to Amazon ES.
  • Creates Lambda within the same VPC of Amazon ES elastic network interfaces (ENI). Amazon ES places an ENI in the VPC for each of your data nodes. The communication from Lambda to the Amazon ES domain is via this ENI.
  • Configures file create event notification on Access log S3 bucket to trigger the Lambda function. The function code segments are discussed in detail in this GitHub project.

You must make several considerations before you proceed with a production-grade deployment. For this post, I use one primary shard with no replicas. As a best practice, we recommend deploying your domain into three Availability Zones with at least two replicas. This configuration lets Amazon ES distribute replica shards to different Availability Zones than their corresponding primary shards and improves the availability of your domain. For more information about sizing your Amazon ES, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

We recommend setting the shard count based on your estimated index size, using 50 GB as a maximum target shard size. You should also define an index template to set the primary and replica shard counts before index creation. For more information about best practices, see Best practices for configuring your Amazon Elasticsearch Service domain.

For high-frequency incoming data, you can rotate indexes either per day or per week depending on the size of data being generated. You can use Index State Management to define custom management policies to automate routine tasks and apply them to indexes and index patterns.

Creating the Kibana user

With Amazon ES, you can configure fine-grained users to control access to your data. Fine-grained access control adds multiple capabilities to give you tighter control over your data. This feature includes the ability to use roles to define granular permissions for indexes, documents, or fields and to extend Kibana with read-only views and secure multi-tenant support. For more information on granular access control, see Fine-Grained Access Control in Amazon Elasticsearch Service.

For this post, you create a fine-grained role for Kibana access and map it to a user.

  1. Navigate to Kibana and enter the primary user credentials:
    1. User nameadminuser01
    2. Password[email protected]

To access Kibana, you must have access to the VPC. For more information about accessing Kibana, see Controlling Access to Kibana.

  1. Choose Security, Roles.
  2. For Role name, enter kibana_only_role.
  3. For Cluster-wide permissions, choose cluster_composite_ops_ro.
  4. For Index patterns, enter access-log and kibana.
  5. For Permissions: Action Groups, choose read, delete, index, and manage.
  6. Choose Save Role Definition.
  7. Choose Security, Internal User Database, and Create a New User.
  8. For Open Distro Security Roles, choose Kibana_only_role (created earlier).
  9. Choose Submit.

The user kibanauser01 now has full access to Kibana and access-logs indexes. You can log in to Kibana with this user and create the visuals and dashboards.

Building dashboards

You can use Kibana to build interactive visuals and analyze the trends and combine the visuals for different use cases in a dashboard. For example, you may want to see the number of requests made to the buckets in the last two days.

  1. Log in to Kibana using kibanauser01.
  2. Create an index pattern and set the time range
  3. On the Visualize section of your Kibana dashboard, add a new visualization.
  4. Choose Vertical Bar.

You can select any time range and visual based on your requirements.

  1. Choose the index pattern and then configure your graph options.
  2. In the Metrics pane, expand Y-Axis.
  3. For Aggregation, choose Count.
  4. For Custom Label, enter Request Count.
  5. Expand the X-Axis
  6. For Aggregation, choose Terms.
  7. For Field, choose bucket.
  8. For Order By, choose metric: Request Count.
  9. Choose Apply changes.
  10. Choose Add sub-bucket and expand the Split Series
  11. For Sub Aggregation, choose Date Histogram.
  12. For Field, choose requestdatetime.
  13. For Interval, choose Daily.
  14. Apply the changes by choosing the play icon at the top of the page.

You should see the visual on the right side, similar to the following screenshot.

You can combine graphs of different use cases into a dashboard. I have built some example graphs for general use cases like the number of operations per bucket, user action breakdown for buckets, HTTPS status rate, top users, and tabular formatted error details. See the following screenshots.

Cleaning up

Delete all the resources deployed through the CloudFormation template to avoid any unintended costs.

  1. Disable the access log on source bucket.
  2. On to the CloudFormation console, identify the stacks appropriately, and delete

Summary

This post detailed a solution to visualize and monitor Amazon S3 access logs using Amazon ES to ensure compliance, perform security audits, and discover risks and patterns at scale with minimal latency. To learn about best practices of Amazon ES, see Amazon Elasticsearch Service Best Practices. To learn how to analyze and create a dashboard of data stored in Amazon ES, see the AWS Security Blog.


About the Authors

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

 

 

 

 

Power data analytics, monitoring, and search use cases with the Open Distro for Elasticsearch SQL Engine on Amazon ES

Post Syndicated from Viraj Phanse original https://aws.amazon.com/blogs/big-data/power-data-analytics-monitoring-and-search-use-cases-with-the-open-distro-for-elasticsearch-sql-engine-on-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a popular choice for log analytics, search, real-time application monitoring, clickstream analysis, and more. One commonality among these use cases is the need to write and run queries to obtain search results at lightning speed. However, doing so requires expertise in the JSON-based Elasticsearch query domain-specific language (Query DSL). Although Query DSL is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

To solve this problem, we provided the Open Distro for Elasticsearch SQL Engine on Amazon ES, which we have been expanding since the initial release. The Structured Query Language (SQL) engine is powered by Open Distro for Elasticsearch, an Apache 2.0 licensed distribution of Elasticsearch. For more information about the Open Distro project, see Open Distro for Elasticsearch. For more information about the SQL engine capabilities, see SQL.

As part of this continued investment, we’re happy to announce new capabilities, including a Kibana-based SQL Workbench and a new SQL CLI that makes it even easier for Amazon ES users to use the Open Distro for Elasticsearch SQL Engine to work with their data.

SQL is the de facto standard for data and analytics and one of the most popular languages among data engineers and data analysts. Introducing SQL in Amazon ES allows you to manifest search results in a tabular format with documents represented as rows, fields as columns, and indexes as table names, respectively, in the WHERE clause. This acts as a straightforward and declarative way to represent complex DSL queries in a readable format. The newly added tools can act as a powerful yet simplified way to extract and analyze data, and can support complex analytics use cases.

Features overview

The following is a brief overview of the features of Open Distro for Elasticsearch SQL Engine on Amazon ES:

  • Query tools
    • SQL Workbench – A comprehensive and integrated visual tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. The following screenshot shows a query on the SQL Workbench page.

  • SQL CLI – An interactive, standalone command line tool to run on-demand SQL queries, translate SQL into its REST equivalent, and view and save results as text, JSON, JDBC, or CSV. For following screenshot shows a query on the CLI.

  • Connectors and drivers
    • ODBC driver – The Open Database Connectivity (ODBC) driver enables connecting with business intelligence (BI) applications such as Tableau and exporting data to CSV and JSON.
    • JDBC driver – The Java Database Connectivity (JDBC) driver also allows you to connect with BI applications such as Tableau and export data to CSV and JSON.
  • Query support
    • Basic queries – You can use the SELECT clause, along with FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT to search and aggregate data.
    • Complex queries – You can perform complex queries such as subquery, join, and union on more than one Elasticsearch index.
    • Metadata queries – You can query basic metadata about Elasticsearch indexes using the SHOW and DESCRIBE commands.
  • Delete support
    • Delete – You can delete all the documents or documents that satisfy predicates in the WHERE clause from search results. However, it doesn’t delete documents from the actual Elasticsearch index.
  • JSON and full-text search support
    • JSON – Support for JSON by following PartiQL specification, a SQL-compatible query language, lets you query semi-structured and nested data for any data format.
    • Full-text search support – Full-text search on millions of documents is possible by letting you specify the full range of search options using SQL commands such as match and score.
  • Functions and operator support
    • Functions and operators – Support for string functions and operators, numeric functions and operators, and date-time functions is possible by enabling fielddata in the document mapping.
  • Settings
    • Settings – You can view, configure, and modify settings to control the behavior of SQL without needing to restart or bounce the Elasticsearch cluster.
  • Interfaces
    • Endpoints – The explain endpoint allows translating SQL into Query DSL, and the cursor helps obtain a paginated response for the SQL query result.
  • Monitoring
    • Monitoring – You can obtain node-level statistics by using the stats endpoint.
  • Request and response protocols

Conclusion

Open Distro for Elasticsearch SQL Engine on Amazon ES provides a comprehensive, flexible, and user-friendly set of features to obtain search results from Amazon ES in a declarative manner using SQL. For more information about querying with SQL, see SQL Support for Amazon Elasticsearch Service.

 


About the Author

Viraj Phanse (@vrphanse) is a product management leader at Amazon Web Services for Search Services/Analytics. Prior to AWS, he was in product management/strategy and go-to-market leadership roles at Oracle, Aerospike, INSZoom and Persistent Systems. He is a Fellow and Selection Committee member at Berkeley Angel Network, and a Big Data Advisory Board Member at San Francisco State University. He has completed his M.S. in Computer Science from UCLA and MBA from UC Berkeley’s Haas School of Business.

 

 

Creating customized Vega visualizations in Amazon Elasticsearch Service

Post Syndicated from Markus Bestehorn original https://aws.amazon.com/blogs/big-data/creating-customized-vega-visualizations-in-amazon-elasticsearch-service/

Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.

Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:

  • The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
  • Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
  • Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.

This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.

Use case

For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample:

207.46.13.136 - - [22/Jan/2019:03:56:19 +0330] "GET /product/14926 HTTP/1.1" 404 33617 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Each entry contains a source IP address (for the preceding data sample, 207.46.13.136), the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.

In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:

  • Requests that result in high bandwidth use
  • Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic

The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.

We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.

Overview of solution

To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.

The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.

Prerequisites and deploying your CloudFormation template

To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:

  1. Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
  1. Keep all the default parameters as is.
  2. Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  3. Choose Create Stack to start the deployment.

The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.

Connecting to the Kibana

After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:

  1. On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.

This opens the Kibana login prompt.

  1. Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:

Amazon Cognito requires you to change the password, which Kibana prompts you for.

  1. Enter a new password and a name for yourself.

Remember this password; there is no recovery option in this setup if you need to re-authenticate.

Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).

You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.

Creating an index pattern and exploring the data

Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:

  1. On the Kibana home screen, choose Discover.
  2. For Index Pattern, enter “vega*”.
  3. Choose Next step.

  1. For Time Filter field name, choose timestamp.
  2. Choose Create index pattern.

After a few seconds, a summary page with details about the created index pattern appears.

The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.

In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:

size_in_bytes: { "size": 13000, "freq": 11 }, …

The following screenshot shows both the visualization and raw data.

As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.

Vega on-the-fly data transformation

To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:

  1. Variable initialization
  2. Mapping documents to a key-value structure for the histogram
  3. Grouping values into arrays by keys
  4. Aggregation of array content into the histogram

To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:

#init_script
POST _scripts/initialize_variables
{
  "script": {
    "lang" : "painless",
    "source" : "state.test = [:]"
  }
}

Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).

To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:

{
  “timestamp”: “2019/01/26 12:58:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 369},
    {“size”: 200, “freq”: 62},
    … 
]}
{
  “timestamp”: “2019/01/26 12:57:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 386},
    {“size”: 200, “freq”: 60},
    … 
]}

The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.

#map_script
POST _scripts/map_documents
{
  "script": {
    "lang": "painless",
    "source": """
      if (doc.containsKey(params.bucketField))
        for (int i=0; i<doc[params.bucketField].length; i++)
        {
          def key = doc[params.bucketField][i];
          def value = doc[params.countField][i];
          state.test[(key).toString()] = value;
        }
        """
  }
}

Using our example input data results for this step, the following output is computed:

{“100”: 369, “200”: 62, …}
{“100”: 386, “200”: 60, …}

As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:

#combine_script
POST _scripts/combine_documents
{
  "script": {
    "lang": "painless",
    "source": "return state.test"
  }
}

With the example intermediate output from the previous step, the following intermediate output is computed:

{
  “100”: [369, 386],
  “200”: [62, 60], …
}

Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:

#reduce_script
POST _scripts/aggregate_histogram_buckets
{
  "script": {
    "lang": "painless",
    "source": """
      Map result = [:];
      states.forEach(value ->
        value.forEach((k,v) ->
          {result.merge(k, v, (value1, value2) -> value1+value2)}
        )
      );
      return result.entrySet().stream().sorted(
        Comparator.comparingInt((entry) -> 
        Integer.parseInt(entry.key))
      ).map(
        entry -> [
          "messagesize": Integer.parseInt(entry.key),
          "messagecount": entry.value
        ]).collect(Collectors.toList())
      """
  }
}

The final output for our computation example has the following format:

{
  “messagesize”: 100,
  “messagecount”: 755,
}
{
  “messagesize”: 200,
  “messagesize”: 122
}

This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.

Creating a Vega-Lite visualization

The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.

A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:

{
  $schema: https://vega.github.io/schema/vega-lite/v2.json
  data: {
    name: table
    url: {
      index: vega-visu-blog-index
      %timefield%: timestamp
      %context%: true
      body: {
        aggs: {
          temporal_hist_agg: {
            scripted_metric: {
              init_script: {
                id: "initialize_variables"
              }
              map_script: {
                id: map_documents
                params: {
                  bucketField: size_in_bytes.size
                  countField: size_in_bytes.freq
                }
              }
              combine_script: {
                id: "combine_documents"
                }
              reduce_script: {
                id: "aggregate_histogram_buckets"
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.temporal_hist_agg.value"}
  }
  mark: bar
  title: {
    text: Temporal Message Size Histogram
    frame: bounds
  },
  encoding: {
    x: {
      field: messagesize
      type: ordinal
      axis: {
        title: "Message Size Bucket"
        format: "~s"
      }
      bin: {
        binned: true,
        step: 10000
      }
    }
    y: {
      field: messagecount
      type: quantitative
      axis: {
        title: "Count"
        }
    }
  }
}

Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.

To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.

Exploring and debugging scripted metric aggregation

The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:

VEGA_DEBUG.view.data('table')

The Vega Debugger shows the data as a tree, which you can easily explore.

The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.

Cleaning up

To delete all resources and stop incurring costs to your AWS account, complete the following steps:

  1. On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
  2. Choose Delete.

The process can take up to 15 minutes, and removes all the resources you deployed for following this post.

Conclusion

Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.

 


About the authors

Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

 

 

Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.

 

Simplifying and modernizing home search at Compass with Amazon ES

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/simplifying-and-modernizing-home-search-at-compass-with-amazon-es/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and operate Elasticsearch in AWS at scale. It’s a widely popular service and different customers integrate it in their applications for different search use cases.

Compass rearchitected their search solution with AWS services, including Amazon ES, to deliver high-quality property searches and saved searches for their customers.

In this post, we learn how Compass’s search solution evolved, what challenges and benefits they found with different architectures, and how Amazon ES gives them a long-term scalable solution. We also see how Amazon Managed Streaming for Apache Kafka (Amazon MSK) helped create event-driven, real-time streaming capabilities of property listing data. You can apply this solution to similar use cases.

Overview of Amazon ES

Amazon ES makes it easy to deploy, operate, and scale Elasticsearch for log analytics, application monitoring, interactive search, and more. It’s a fully managed service that delivers the easy-to-use APIs and real-time capabilities of Elasticsearch along with the availability, scalability, and security required by real-world applications. It offers built-in integrations with other AWS services, including Amazon Kinesis, AWS Lambda, and Amazon CloudWatch, and third-party tools like Logstash and Kibana, so you can go from raw data to actionable insights quickly.

Amazon ES also has the following benefits:

  • Fully managed – Launch production-ready clusters in minutes. No more patching, versioning, and backups.
  • Access to all data – Capture, retain, correlate, and analyze your data all in one place.
  • Scalable – Resize your cluster with a few clicks or a single API call.
  • Secure – Deploy into your VPC and restrict access using security groups and AWS Identity and Access Management (IAM) policies.
  • Highly available – Replicate across Availability Zones, with monitoring and automated self-healing.
  • Tightly integrated – Seamless data ingestion, security, auditing, and orchestration.

Overview of Amazon MSK

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

Overview of Compass

Urban Compass, Inc. (Compass) operates as a global real estate technology company. The Company provides an online platform that supports buying, renting, and selling real estate assets.

In their own words, “Compass is building the first-of-its-kind modern real estate platform, pairing the industry’s top talent with technology to make the search and sell experience intelligent and seamless. Compass operates in over 24 markets, with over 2 billion in sales this year to date, with 2,300 employees and over 15,000 agents with a vision to find a place for everybody in the world.”

How Compass uses search to help customers

Search is one of Compass’s primary features, which enables its website visitors and agents to look for properties within the platform. The Compass platform has the following search components:

  • Search services – Uses Amazon ES extensively to power searches across hyperlocal real estate data (spanning thousands of attributes). This search component acquires data through Amazon MSK after it’s processed in Apache Spark and stored in Amazon Aurora PostgreSQL.
  • Agent and consumer search – A frontend built on top of search services, which works as an interface between agents, consumers, and the Compass search services. It’s built in React and enables you to seamlessly search real estate data and access hyper-local filters.
  • Saved search – As a consumer, when you run a search and save it, the saved search indexes are updated with your search parameters. When a new listing comes into the system (indexed through the listings Elasticsearch index), the Compass search identifies the saved searches that match the new listing using the percolate feature of Elasticsearch and notifies you of the new property listing.

Evolution of the consumer search

The following sections get into the details of the Compass primary search feature and how its architecture evolved over time, starting with its initial Apache Lucene architecture.

Previous architecture with Apache Lucene

Compass started implementing its search functionalities with direct integration with Lucene, where it was configured through manually provisioned AWS virtual machines and manual installation.

In the following architecture diagram, the MLS (Master Listing Service) system’s data gets pushed through a common extract, transform, and load (ETL) framework and is populated into the listing database. From there, the data gets pushed to the Lucene cluster to be queried by the frontend search REST APIs.

This architecture had different pain points that involved a lot of heavy lifting with Lucene, such as:

  • Manual maintenance and configuration required specialized knowledge
  • Challenges around disk forecasting and growth
  • Scalability and high performance achieved through manual sharding
  • Configuring new data fields was labor-intensive

New architecture with Amazon ES

To overcome these challenges, Compass started using Amazon ES.

The following architecture diagram shows the evolution of Compass’s system. Compass ran Amazon ES and Lucene in parallel, shadowing the traffic to verify and quality assurance the results in production. They ran the services side by side for two months, until they were confident that Amazon ES replicated the same results.

In addition, Compass added full text search (description search) immediately after switching to Amazon ES. As a result, listing and features are available faster to their customers, which allowed Compass to expand nationally and implement a localized search.

Enhanced architecture with Amazon MSK

Compass further enhanced its architecture with Amazon MSK, which enabled parallel processing by different teams that push transformed events to the Kafka cluster. The following diagram shows the enhanced architecture.

With the new architecture, Compass search saw some immediate wins:

  • Reduced maintenance costs because Amazon ES is a managed service and you don’t need to take on the overhead of cluster administration or management
  • Additional performance benefits because the index build time reduced from 8 hours to 1 hour
  • Ability to create or tear down clusters with just one click, which eased maintenance

Because of the Amazon ES user interface and monitoring capabilities, the Compass team could assess the usage pattern easily and perform capacity planning and cost prediction.

Implementing saved searches with the Elasticsearch percolator

The Elasticsearch percolator inverts the query-document usage pattern—you index queries instead of documents and query with documents against the indexed queries. The search results for a percolate query are the queries that match the document.

Compass used this feature to implement their saved search feature by indexing each user’s search queries. When a property listing arrives, it uses Amazon ES to retrieve the queries that match and notifies the respective customers.

The following diagram illustrates this workflow.

Before percolate, Compass reran saved search queries to inspect for new matches. Percolate allows them to notify users in a timely manner about changes in their searches. On average, Compass stores 250,000 search queries.

When new listings arrive, they submit an average of 5–10 bulk requests per minute, where each bulk request contains 1,000 documents. The maximum latency on the Amazon ES side varies from 750–2,500 milliseconds, with an 18-node cluster of m5.12xlarge instance types.

The following diagram shows the saved search architecture. Search criteria is saved in search indexes. When a new listing arrives through the MLS and Amazon MSK listing stream, it executes the percolate processor, which pushes a message to the Amazon MSK saved search match topic stream. Then it gets pulled by the saved search service, from which notifications are pushed to the end-user.

The architecture has the following benefits:

  • When you look for properties against a search criteria, it gets saved in Amazon ES. As agents add properties into the system that match the saved search criteria, you get notified that a new property has been added that matches your criteria.
  • Because of the percolate feature, you get notified as soon as a property is added into the system, which reduced the lag from 1 hour to 1 minute.
  • Before percolate, the Compass team had a batch job that pulled new property records against searches, but now they can use push-based notification.

Compass has a great growth path as their platform scales to more listings, agents, and users. They plan to develop the following features:

  • Real-time streaming
  • AI for search ranking
  • Personalized search through AI

Conclusion

In this post, we explained how Compass uses Amazon ES to bring their customers relevant results for their real estate needs. Whether you’re searching in real time for your next listing or using Compass’s saved search to monitor the market, Amazon ES delivers the results you need.

With the effort they saved from managing their Lucene infrastructure, Compass has focused on their business and engineering, which has opened up new opportunities for them.

 


About the Author

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives.

Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service

Post Syndicated from Chris Swierczewski original https://aws.amazon.com/blogs/big-data/using-random-cut-forests-for-real-time-anomaly-detection-in-amazon-elasticsearch-service/

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon Elasticsearch Service and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm. In particular, we:

  • Discuss the goals of anomaly detection
  • Share how to use the RCF algorithm to detect anomalies and why we chose RCF for this tool
  • Interpret the output of anomaly detection for Elasticsearch
  • Compare the results of the anomaly detector to commonly used methods

What is anomaly detection?

Human beings have excellent intuition and can detect when something is out of order. Often, an anomaly or outlier can appear so obvious you just “know it when you see it.” However, you can’t base computational approaches to anomaly detection on such intuition; you must found them on mathematical definitions of anomalies.

The mathematical definition of an anomaly is varied and typically addresses the notion of separation from normal observation. This separation can manifest in several ways via multiple definitions. One common definition is “a data point lying in a low-density region.” As you track a data source, such as total bytes transferred from a particular IP address, number of logins on a given website, or number of sales per minute of a particular product, the raw values describe some probability or density distribution. A high-density region in this value distribution is an area of the domain where a data point is highly likely to exist. A low-density region is where data tends not to appear. For more information, see Anomaly detection: A survey.

For example, the following image shows two-dimensional data with a contour map indicating the density of the data in that region.

The data point in the bottom-right corner of the image occurs in a low-density region and, therefore, is considered anomalous. This doesn’t necessarily mean that an anomaly is something bad. Rather, under this definition, you can describe an anomaly as behavior that rarely occurs or is outside the normal scope of behavior.

Random Cut Forests and anomaly thresholding

The algorithmic core of the anomaly detection feature consists of two main components:

  • A RCF model for estimating the density of an input data stream
  • A thresholding model for determining if a point should be labeled as anomalous

You can use the RCF algorithm to summarize a data stream, including efficiently estimating its data density, and convert the data into anomaly scores. Anomaly scores are positive real numbers such that the larger the number, the more anomalous the data point. For more information, see Real Time Anomaly Detection in Open Distro for Elasticsearch.

We chose RCF for this plugin for several reasons:

  • Streaming context – Elasticsearch feature queries are streaming, in that the anomaly detector only receives each new feature aggregate one at a time.
  • Expensive queries – Especially on a large cluster, each feature query may be costly in CPU and memory resources. This limits the amount of historical data we can obtain for model training and initialization.
  • Customer hardware – Our anomaly detection plugin runs on the same hardware as our customers’ Elasticsearch cluster. Therefore, we must be mindful of our plugin’s CPU and memory impact.
  • Scalable – It is preferred if you can distribute the work required to determine anomalous data across the nodes in the cluster.
  • Unlabeled data – Even for training purposes, we don’t have access to labeled data. Therefore, the algorithm must be unsupervised.

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

Understanding the output

The anomaly detector outputs two values: an anomaly grade and a confidence score. The anomaly grade is a measurement of the severity of an anomaly on a scale from zero to one. A zero anomaly grade indicates that the corresponding data point is normal. Any non-zero grade means that the anomaly score output by RCF exceeds the calculated score threshold, and therefore indicates the presence of an anomaly. Using the mathematical definition introduced at the beginning of this post, the grade is inversely related to the anomaly’s density; that is, the rarer the event, the higher the corresponding anomaly grade.

The confidence score is a measurement of the probability that the anomaly detection model correctly reports an anomaly within the algorithm’s inherent error bounds. We derive the model confidence from three sources:

  • A statistical measurement of whether the RCF model has observed enough data. As the RCF model observes more data, this source of confidence approaches 100%.
  • A confidence upper bound comes from the approximation made by the distribution sketch in the thresholding model. The KLL algorithm can only predict the score threshold within particular error bounds with a certain probability.
  • A confidence measurement attributed to each node of the Elasticsearch cluster. If a node is lost, the corresponding model data is also lost, which leads to a temporary confidence drop.

The NYC Taxi dataset

We demonstrate the effectiveness of the new anomaly detector feature on the New York City taxi passenger dataset. This data contains 6 months of taxi ridership volume from New York City aggregated into 30-minute windows. Thankfully, the dataset comes with labels in the form of anomaly windows, which indicate a period of time when an anomalous event is known to occur. Example known events in this dataset are the New York City marathon, where taxi ridership uncharacteristically spiked shortly after the event ended, and the January 2015 North American blizzard, when at one point the city ordered all non-essential vehicles off the streets, which resulted in a significant drop in taxi ridership.

We compare the results of our anomaly detector to two common approaches for detecting anomalies: a rules-based approach and the Gaussian distribution method.

Rules-based approach

In a rules-based approach, you mark a data point as anomalous if it exceeds a preset, human-specified boundary. This approach requires significant domain knowledge of the incoming data and can break down if the data has any upward or downward trends.

The following graph is a plot of the NYC taxi dataset with known anomalous event periods indicated by shaded regions. The model’s anomaly detection output is shown in red below the taxi ridership values. A set of human labelers received the first month of data (1,500 data points) to define anomaly detection rules. Half of the participants responded by stating they didn’t have sufficient information to confidently define such rules. From the remaining responses, the consolidated rules for anomalies are either that the value is equal to or greater than 30,000, or the value is below 20,000 for 150 points (about three days).

In this use case, the human annotators do a good enough job in this particular range of data. However, this approach to anomaly detection doesn’t scale well and may require a large amount of training data before a human can set reasonable thresholds that don’t suffer from a high false positive or high false negative rate. Additionally, as mentioned earlier, if this data develops an upward or downward trend, our team of annotators needs to revisit these constant-value thresholds.

Gaussian distribution method

A second common approach is to fit a Gaussian distribution to the data and define an anomaly as any value that is three standard deviations away from the mean. To improve the model’s ability to adapt to new information, the distribution is typically fit on a sliding window of the observations. Here, we determine the mean and standard deviation from the 1,500 most recent data points and use these to make predictions on the current value. See the following graph.

The Gaussian model detects the clear ridership spike at the marathon but isn’t robust enough to capture the other anomalies. In general, such a model can’t capture certain kinds of temporal anomalies where, for example, there’s a sudden spike in the data or other change in behavior that is still within the normal range of values.

Anomaly detection tool

Finally, we look at the anomaly detection results from the anomaly detection tool. The taxi data is streamed into an RCF model that estimates the density of the data in real time. The RCF sends these anomaly scores to the thresholding model, which decides whether the corresponding data point is anomalous. If so, the model reports the severity of the anomaly in the anomaly grade. See the following graph.

Five out of seven of the known anomalous events are successfully detected with zero false positives. Furthermore, with our definition of anomaly grade, we can indicate which anomalies are more severe than others. For example, the NYC Marathon spike is much more severe than those of Labor Day and New Year’s Eve. Based on the definition of an anomaly in terms of data density, the behavior observed at the NYC Marathon lives in a very low-density region, whereas, by the time we see the New Year’s Eve spike, this kind of behavior is still rare but not as rare anymore.

Summary

In this post, you learned about the goals of anomaly detection and explored the details of the model and output of the anomaly detection feature, now available in Amazon ES and Open Distro for Elasticsearch. We also compared the results of the anomaly detection tool to two common models and observed considerable performance improvement.


About the Authors

Chris Swierczewski is an applied scientist in Amazon AI. He enjoys hiking and backpacking with his family.

 

 

 

 

 

Lai Jiang is a software engineer working on machine learning and Elasticsearch at Amazon Web Services. His primary interests are algorithms and math. He is an active contributor to Open Distro for Elasticsearch.

 

 

 

Moving to managed: The case for the Amazon Elasticsearch Service

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/moving-to-managed-the-case-for-amazon-elasticsearch-service/

Prior to joining AWS, I led a development team that built mobile advertising solutions with Elasticsearch. Elasticsearch is a popular open-source search and analytics engine for log analytics, real-time application monitoring, clickstream analysis, and (of course) search. The platform I was responsible for was essential to driving my company’s business.

My team ran a self-managed implementation of Elasticsearch on AWS. At the time, a managed Elasticsearch offering wasn’t available. We had to build the scripting and tooling to deploy our Elasticsearch clusters across three geographical regions. This included the following tasks (and more):

  • Configuring the networking, routing, and firewall rules to enable the cluster to communicate.
  • Securing Elasticsearch management APIs from unauthorized access.
  • Creating load balancers for request distribution across data nodes.
  • Creating automatic scaling groups to replace the instances if there were issues.
  • Automating the configuration.
  • Managing the upgrades for security concerns.

If I had one word to describe the experience, it would be “painful.” Deploying and managing your own Elasticsearch clusters at scale takes a lot of time and knowledge to do it properly. Perhaps most importantly, it took my engineers away from doing what they do best—innovating and producing solutions for our customers.

Amazon Elasticsearch Service (Amazon ES) launched on October 1, 2015, almost 2 years after I joined AWS. Almost 5 years later, Amazon ES is in the best position ever to provide you with a compelling feature set that enables your search-related needs. With Amazon ES, you get a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters securely and cost-effectively in the AWS Cloud. Amazon ES offers direct access to the Elasticsearch APIs, which makes your existing code and applications using Elasticsearch work seamlessly with the service.

Amazon ES provisions all the resources for your Elasticsearch cluster and launches it in any Region of your choice within minutes. It automatically detects and replaces failed Elasticsearch nodes, which reduces the overhead associated with self-managed infrastructures. You can scale your cluster horizontally or vertically, up to 3 PB of data, with zero downtime through a single API call or a few clicks on the AWS Management Console. With this flexibility, Amazon ES can support any workload from single-node development clusters to production-scale, multi-node clusters.

Amazon ES also provides a robust set of Kibana plugins, free of any licensing fees. Features like fine-grained access control, alerting, index state management, and SQL support are just a few of the many examples. The Amazon ES feature set originates from the needs of customers such as yourself and through open-source community initiatives such as Open Distro for Elasticsearch (ODFE).

You need to factor several considerations into your decision to move to a managed service. Obviously, you want your teams focused on doing meaningful work that propels the growth of your company. Deciding what processes you offload to a managed service versus what are best self-managed can be a challenge. Based on my experience managing Elasticsearch at my prior employer, and having worked with thousands of customers who have migrated to AWS, I consider the following sections important topics for you to review.

Workloads

Before migrating to a managed service, you might look to what others are doing in their “vertical,” whether it be in finance, telecommunications, legal, ecommerce, manufacturing, or any number of other markets. You can take comfort in knowing that thousands of customers across these verticals successfully deploy their search, log analytics, SIEM, and other workloads on Amazon ES.

Elasticsearch is by default a search engine. Compass uses Amazon ES to scale their search infrastructure and build a complete, scalable, real estate home-search solution. By using industry-leading search and analytical tools, they make every listing in the company’s catalog discoverable to consumers and help real estate professionals find, market, and sell homes faster.

With powerful tools such as aggregations and alerting, Elasticsearch is widely used for log analytics workloads to gain insights into operational activities. As Intuit moves to a cloud hosting architecture, the company is on an “observability” journey to transform the way it monitors its applications’ health. Intuit used Amazon ES to build an observability solution, which provides visibility to its operational health across the platform, from containers to serverless applications.

When it comes to security, Sophos is a worldwide leader in next-generation cybersecurity, and protects its customers from today’s most advanced cyber threats. Sophos developed a large-scale security monitoring and alerting system using Amazon ES and other AWS components because they know Amazon ES is well suited for security use cases at scale.

Whether it be finding a home, detecting security events, or assisting developers with finding issues in applications, Amazon ES supports a wide range of use cases and workloads.

Cost

Any discussion around operational best practices has to factor in cost. With Amazon ES, you can select the optimal instance type and storage option for your workload with a few clicks on the console. If you’re uncertain of your compute and storage requirements, Amazon ES has on-demand pricing with no upfront costs or long-term commitments. When you know your workload requirements, you can lock in significant cost savings with Reserved Instance pricing for Amazon ES.

Compute and infrastructure costs are just one part of the equation. At AWS, we encourage customers to evaluate their Total Cost of Ownership (TCO) when comparing solutions. As an organizational decision-maker, you have to consider all the related cost benefits when choosing to replace your self-managed environment. Some of the factors I encourage my customers to consider are:

  • How much are you paying to manage the operation of your cluster 24/7/365?
  • How much do you spend on building the operational components, such as support processes, as well as automated and manual remediation procedures for clusters in your environment?
  • What are the license costs for advanced features?
  • What costs do you pay for networking between the clusters or the DNS services to expose your offerings?
  • How much do you spend on backup processes and how quickly can you recover from any failures?

The beauty of Amazon ES is that you do not need to focus on these issues. Amazon ES provides operational teams to manage your clusters, automated hourly backups of data for 14 days for your cluster, automated remediation of events with your cluster, and incremental license-free features as one of the basic tenants of the service.

You also need to pay particular close attention to managing the cost of storing your data in Elasticsearch. In the past, to keep storage costs from getting out of control, self-managed Elasticsearch users had to rely on solutions that were complicated to manage across data tiers and in some cases didn’t give you quick access to that data. AWS solved this problem with UltraWarm, a new, low-cost storage tier. UltraWarm lets you store and interactively analyze your data, backed by Amazon Simple Storage Service (Amazon S3) using Elasticsearch and Kibana, while reducing your cost per GB by almost 90% over existing hot storage options.

Security

In my conversations with customers, their primary concern is security. One data breech can cost millions and forever damage a company’s reputation. Providing you with the tools to secure your data is a critical component of our service. For your data in Amazon ES, you can do the following:

Many customers want to have a single sign-on environment when integrating with Kibana. Amazon ES offers Amazon Cognito authentication for Kibana. You can choose to integrate identity providers such as AWS Single Sign-On, PingFederate, Okta, and others. For more information, see Integrating Third-Party SAML Identity Providers with Amazon Cognito User Pools.

Recently, Amazon ES introduced fine-grained access control (FGAC). FGAC provides granular control of your data on Amazon ES. For example, depending on who makes the request, you might want a search to return results from only one index. You might want to hide certain fields in your documents or exclude certain documents altogether. FGAC gives you the power to control who sees what data exists in your Amazon ES domain.

Compliance

Many organizations need to adhere to a number of compliance standards. Those that have experienced auditing and certification activities know that ensuring compliance is an expensive, complex, and long process. However, by using Amazon ES, you benefit from the work AWS has done to ensure compliance with a number of important standards. Amazon ES is PCI DSS, SOC, ISO, and FedRamp compliant to help you meet industry-specific or regulatory requirements. Because Amazon ES is a HIPAA-eligible service, processing, storing and transmitting PHI can help you accelerate development of these sensitive workloads.

Amazon ES is part of the services in scope of the most recent assessment. You can build solutions on top of Amazon ES with the knowledge that independent auditors acknowledge that the service meets the bar for these important industry standards.

Availability and resiliency

When you build an Elasticsearch deployment either on premises or in cloud environments, you need to think about how your implementation can survive failures. You also need to figure out how you can recover from failures when they occur. At AWS, we like to plan for the fact that things do break, such as hardware failures and disk failures, to name a few.

Unlike virtually every other technology infrastructure provider, each AWS Region has multiple Availability Zones. Each Availability Zone consists of one or more data centers, physically separated from one another, with redundant power and networking. For high availability and performance of your applications, you can deploy applications across multiple Availability Zones in the same Region for fault tolerance and low latency. Availability Zones interconnect with fast, private fiber-optic networking, which enables you to design applications that automatically fail over between Availability Zones without interruption. Availability Zones are more highly available, fault tolerant and scalable than traditional single or multiple data center infrastructures.

Amazon ES offers you the option to deploy your instances across one, two, or three AZs. If you’re running development or test workloads, pick the single-AZ option. Those running production-grade workloads should use two or three Availability Zones.

For information, see Increase availability for Amazon Elasticsearch Service by Deploying in three Availability Zones. Additionally, deploying in multiple Availability Zones with dedicated master nodes means that you get the benefit of the Amazon ES SLA.

Operations

A 24/7/365 operational team with experience managing thousands of Elasticsearch clusters around the globe monitors Amazon ES. If you need support, you can get expert guidance and assistance across technologies from AWS Support to achieve your objectives faster at lower costs. I want to underline the importance of having a single source for support for your cloud infrastructure. Amazon ES doesn’t run in isolation, and having support for your entire cloud infrastructure from a single source greatly simplifies the support process. AWS also provides you with the option to use Enterprise level support plans, where you can have a dedicated technical account manager who essentially becomes a member of your team and is committed to your success with AWS.

Using tools on Amazon ES such as alerting, which provides you with a means to take action on events in your data, and index state management, which enables you to automate activities like rolling off logs, gives you additional operation features that you don’t need to build.

When it comes to monitoring your deployments, Amazon ES provides you with a wealth of Amazon CloudWatch metrics with which you can monitor all your Amazon ES deployments within a “single pane of glass.” For more information, see Monitoring Cluster Metrics with Amazon CloudWatch.

Staying current is another important topic. To enable access to newer versions of Elasticsearch and Kibana, Amazon ES offers in-place Elasticsearch upgrades for domains that run versions 5.1 and later. Amazon ES provides access to the most stable and current versions in the open-source community as long as the distribution passes our stringent security evaluations. Our service prides itself on the fact that we offer you a version that has passed our own internal AWS security reviews.

AWS integrations and other benefits

AWS has a broad range of services that seamlessly integrate with Amazon ES. Like many customers, you may want to monitor the health and performance of your native cloud services on AWS. Most AWS services log events into Amazon CloudWatch Logs. You can configure a log group to stream data it receives to your Amazon ES domain through a CloudWatch Logs subscription.

The volume of log data can be highly variable, and you should consider buffering layers when operating at a large scale. Buffering allows you to design stability into your processes. When designing for scale, this is one of easiest ways I know to avoid overwhelming your cluster with spikey ingestion events. Amazon Kinesis Data Firehose has a direct integration with Amazon ES and offers buffering and retries as part of the service. You configure Amazon ES as a destination through a few simple settings, and data can begin streaming to your Amazon ES domain.

Increased speed and agility

When building new products and tuning existing solutions, you need to be able to experiment. As part of that experimentation, failing fast is an accepted process that gives your team the ability to try new approaches to speed up the pace of innovation. Part of that process involves using services that allow you to create environments quickly and, should the experiment fail, start over with a new approach or use different features that ultimately allow you to achieve the desired results.

With Amazon ES, you receive the benefit of being able to provision an entire Elasticsearch cluster, complete with Kibana, on the “order of minutes” in a secure, managed environment. If your testing doesn’t produce the desired results, you can change the dimensions of your cluster horizontally or vertically using different instance offerings within the service via a single API call or a few clicks on the console.

When it comes to deploying your environment, native tools such as Amazon CloudFormation provide you with deployment tooling that gives you the ability to create entire environments through configuration scripting via JSON or YAML. The AWS Command Line Interface (AWS CLI) provides command line tooling that also can spin up domains with a small set of commands. For those who want to ride the latest wave in scripting their environments, the AWS CDK has a module for Amazon ES.

Conclusion

Focusing your teams on doing important, innovative work creating products and services that differentiate your company is critical. Amazon ES is an essential tool to provide operational stability, security, and performance of your search and analytics infrastructure. When you consider the following benefits Amazon ES provides, the decision to migrate is simple:

  • Support for search, log analytics, SIEM, and other workloads
  • Innovative functionality using UltraWarm to help you manage your costs
  • Highly secure environments that address PCI and HIPAA workloads
  • Ability to offload operational processes to an experienced provider that knows how to operate Elasticsearch at scale
  • Plugins at no additional cost that provide fine-grained access, vector-based similarity algorithms, or alerting and monitoring with the ability to automate incident response.

You can get started using Amazon ES with the AWS Free Tier. This tier provides free usage of up to 750 hours per month of a t2.small.elasticsearch instance and 10 GB per month of optional EBS storage (Magnetic or General Purpose).

Over the course of the next few months, I’ll be co-hosting a series of posts that introduce migration patterns to help you move to Amazon ES. Additionally, AWS has a robust partner ecosystem and a professional services team that provide you with skilled and capable people to assist you with your migration.

 


About the Author

Kevin Fallis (@AWSCodeWarrior) is an AWS specialist search solutions architect.His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

 

Best practices for configuring your Amazon Elasticsearch Service domain

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/best-practices-for-configuring-your-amazon-elasticsearch-service-domain/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy to deploy, secure, scale, and monitor your Elasticsearch cluster in the AWS Cloud. Elasticsearch is a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon ES domains.

The most important practice is to iterate. If you follow these best practices, you can plan for a baseline Amazon ES deployment. Elasticsearch behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.

Deploying Amazon ES

Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via Amazon ES APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.

When you configure your Amazon ES domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch is a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability.

Amazon ES supports five instance classes: M, R, I, C, and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the M5, R5, I3, C5, and T2.

Choosing your instance type for data nodes

When choosing an instance type for your data nodes, bear in mind that these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). As a best practice, for heavy production workloads, choose the R5 or I3 instance type. If your emphasis is primarily on performance, the R5 typically delivers the best performance for log analytics workloads, and often for search workloads. The I3 instances are strong contenders and may suit your workload better, so you should test both. If your emphasis is on cost, the I3 instances have better cost efficiency at scale, especially if you choose to purchase reserved instances.

For an entry-level instance or a smaller workload, choose the M5s. The C5s are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T2 instances for development or QA workloads, but not for production. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

Choosing your instance type for dedicated master nodes

When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. The C5 instances work best as dedicated masters up to about 75 data node clusters. Above that node count, you should choose R5.

Choosing Availability Zones

Amazon ES makes it easy to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones. As a best practice, choose three Availability Zones for your production deployments.

When you choose more than one Availability Zone, Amazon ES deploys data nodes equally across the zones and makes sure that replicas go into different zones. Additionally, when you choose more than one Availability Zone, Amazon ES always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.

Elasticsearch index and shard design

When you use Amazon ES, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.

Amazon ES partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.

Index patterns

For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.

Setting your shard counts

There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Elasticsearch creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Elasticsearch adjusts accordingly by creating or removing replicas.

You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for log analytics is to set an index template. See the following code:

PUT _template/<template_name>
{
    "index_patterns": ["logs-*"],
    "settings": {
        "number_of_shards": 10,
        "number_of_replicas": 1
    },
    "mappings": {
        …
    }
}

When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.

You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.

For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.

You should try to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default settings for Elasticsearch versions 6 and below are 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed equally. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.

Determining storage size

So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.

You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.

The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M5 or R5, you can deploy EBS volumes up to the supported limit. For more information, see Amazon Elasticsearch Service Limits.

For instances with ephemeral store, storage is limited by the instance type (for example, I3.8xlarge.elasticsearch has 7.8 TB of attached storage). If you choose EBS, you should use the general purpose, GP2, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.

Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.

After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.

Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.

As you work with shard and instance counts, bear in mind that Amazon ES works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R5.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.

Use case

Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R5 instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.

The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R5.xlarge.elasticsearch, 9x R5.2xlarge.elasticsearch, or 6x R5.4xlarge.elasticsearch instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).

Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R5.4xlarge.elasticsearch has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R5.4xlarge.elasticsearch, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x C5.large.elasticsearch dedicated master nodes.

Conclusion

This post covered some of the core best practices for deploying your Amazon ES domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.

 


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

General Availability of UltraWarm for Amazon Elasticsearch Service

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/general-availability-of-ultrawarm-for-amazon-elasticsearch-service/

Today, we are happy to announce the general availability of UltraWarm for Amazon Elasticsearch Service.

This new low-cost storage tier provides fast, interactive analytics on up to three petabytes of log data at one-tenth of the cost of the current Amazon Elasticsearch Service storage tier.

UltraWarm, complements the existing Amazon Elasticsearch Service hot storage tier by providing less expensive storage for older and less-frequently accessed data while still ensuring that snappy, interactive experience that Amazon Elasticsearch Service customers have come to expect. Amazon Elasticsearch Service stores data in Amazon S3 while using custom, highly-optimized nodes, purpose-built on the AWS Nitro System, to cache, pre-fetch, and query that data.

There are many use cases for the Amazon Elasticsearch Service, from building a search system for your website, storing, and analyzing data from application or infrastructure logs. We think this new storage tier will work particularly well for customers that have large volumes of log data.

Amazon Elasticsearch Service is a popular service for log analytics because of its ability to ingest high volumes of log data and analyze it interactively. As more developers build applications using microservices and containers, there is an explosive growth of log data. Storing and analyzing months or even years worth of data is cost-prohibitive at scale, and this has led customers to use multiple analytics tools, or delete valuable data, missing out on essential insights that the longer-term data could yield.

AWS built UltraWarm to solve this problem and ensures that developers, DevOps engineers, and InfoSec experts can analyze recent and longer-term operational data without needing to spend days restoring data from archives to an active searchable state in an Amazon Elasticsearch Service cluster.

So let’s take a look at how you use this new storage tier by creating a new domain in the AWS Management Console.

Firstly, I go to the Amazon Elasticsearch Service console and click on the button to Create a new domain. This then takes me through a workflow to set up a new cluster, for the most part setting up a new domain with UltraWarm is identical to setting up a regular domain, I will point out the couple of things you will need to do differently.

On Step 1 of the workflow, I click on the radio button to create a Production deployment type and click Next.

I continue to fill out the configuration in Step 2. Then, near the end, I check the box to Enable UltraWarm data nodes and select the Instance type I want to use. I go with the default ultrawarm1.medium.elasticsearch and then ask for 3 of them, there is a requirement to have at least 2 nodes.

Everything else about the setup is identical to a regular Amazon Elasticsearch Service setup. After having set up the cluster, I then go to the dashboard and select my newly created domain. The dashboard confirms that my newly created domain has 3 UltraWarm data nodes, each with 1516 (GiB) free storage space.

As well as using UltraWarm on a new domain, you can also enable it for existing domains using the AWS Management Console, CLI, or SDK.

Once you have UltraWarm nodes setup, you can migrate an index from hot to warm by using the following request.

POST _ultrawarm/migration/my-index/_warm

You can then check the status of the migration by using the following request.

GET _ultrawarm/migration/my-index/_status
{
  "migration_status": {
    "index": "my-index",
    "state": "RUNNING_SHARD_RELOCATION",
    "migration_type": "HOT_TO_WARM",
    "shard_level_status": {
      "running": 0,
      "total": 5,
      "pending": 3,
      "failed": 0,
      "succeeded": 2
    }
  }
}

UltraWarm is available today on Amazon Elasticsearch Service version 6.8 and above in 22 regions globally.

Happy Searching

— Martin

Creating a searchable enterprise document repository

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/creating-a-searchable-enterprise-document-repository/

Enterprise customers frequently have repositories with thousands of documents, images and other media. While these contain valuable information for users, it’s often hard to index and search this content. One challenge is interpreting the data intelligently, and another issue is processing these files at scale.

In this blog post, I show how you can deploy a serverless application that uses machine learning to interpret your documents. The architecture includes a queueing mechanism for handling large volumes, and posting the indexing metadata to an Amazon Elasticsearch domain. This solution is scalable and cost effective, and you can modify the functionality to meet your organization’s requirements.

The overall architecture for a searchable document repository solution.

The application takes unstructured data and applies machine learning to extract context for a search service, in this case Elasticsearch. There are many business uses for this design. For example, this could help Human Resources departments in searching for skills in PDF resumes. It can also be used by Marketing departments or creative groups with a large number of images, providing a simple way to search for items within the images.

As documents and images are added to the S3 bucket, these events invoke AWS Lambda functions. This runs custom code to extract data from the files, and also calls Amazon ML services for interpretation. For example, when adding a resume, the Lambda function extracts text from the PDF file while Amazon Comprehend determines the key phrases and topics in the document. For images, it uses Amazon Rekognition to determine the contents. In both cases, once it identifies the indexing attributes, it saves the data to Elasticsearch.

The code uses the AWS Serverless Application Model (SAM), enabling you to deploy the application easily in your own AWS Account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost for large data imports. Additionally, it requires an Elasticsearch domain, which may incur cost on your AWS bill.

To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.

Creating an Elasticsearch domain

This application requires an Elasticsearch development domain for testing purposes. To learn more about production configurations, see the Elasticsearch documentation.

To create the test domain:

  1. Navigate to the Amazon Elasticsearch console. Choose Create a new domain.
  2. For Deployment type, choose Development and testing. Choose Next.
  3. In the Configure Domain page:
    1. For Elasticsearch domain name, enter serverless-docrepo.
    2. Change Instance Type to t2.small.elasticsearch.
    3. Leave all the other defaults. Choose Next at the end of the page.
  4. In Network Configuration, choose Public access. This is adequate for a tutorial but in a production use-case, it’s recommended to use VPC access.
  5. Under Access Policy, in the Domain access policy dropdown, choose Custom access policy.
  6. Select IAM ARN and in the Enter principal field, enter your AWS account ID (learn how to find your AWS account ID). In the Select Action dropdown, select Allow.
  7. Under Encryption, leave HTTPS checked. Choose Next.
  8. On the Review page, review your domain configuration, and then choose Confirm.

Your domain is now being configured, and the Domain status shows Loading in the Overview tab.

After creating the Elasticsearch domain, the domain status shows ‘Loading’.

It takes 10-15 minutes to fully configure the domain. Wait until the Domain status shows Active before continuing. When the domain is ready, note the Endpoint address since you need this in the application deployment.

The Endpoint for an Elasticsearch domain.

Deploying the application

After cloning the repo to your local development machine, open a terminal window and change to the cloned directory.

  1. Run the SAM build process to create an installation package, and then deploy the application:
    sam build
    sam deploy --guided
  2. In the guided deployment process, enter unique names for the S3 buckets when prompted. For the ESdomain parameter, enter the Elasticsearch domain Endpoint from the previous section.
  3. After the deployment completes, note the Lambda function ARN shown in the output:
    Lambda function ARN output.
  4. Back in the Elasticsearch console, select the Actions dropdown and choose Modify Access Policy. Paste the Lambda function ARN as an AWS Principal in the JSON, in addition to the root user, as follows:
    Modify access policy for Elasticsearch.
  5. Choose Submit. This grants the Lambda function access to the Elasticsearch domain.

Testing the application

To test the application, you need a few test documents and images with the file types DOCX (Microsoft Word), PDF, and JPG. This walkthrough uses multiple files to illustrate the queuing process.

  1. Navigate to the S3 console and select the Documents bucket from your deployment.
  2. Choose Upload and select your sample PDF or DOCX files:
    Upload files to S3.
  3. Choose Next on the following three pages to complete the upload process. The application now analyzes these documents and adds the indexing information to Elasticsearch.

To query Elasticsearch, first you must generate an Access Key ID and Secret Access Key. For detailed steps, see this documentation on creating security credentials. Next, use Postman to create an HTTP request for the Elasticsearch domain:

  1. Download and install Postman.
  2. From Postman, enter the Elasticsearch endpoint, adding /_search?q=keyword. Replace keyword with a search term.
  3. On the Authorization tab, complete the Access Key ID and Secret Access Key fields with the credentials you created. For Service Name, enter es.Postman query with AWS authorization.
  4. Choose Send. Elasticsearch responds with document results matching your search term.REST response from Elasticsearch.

How this works

This application creates a processing pipeline between the originating Documents bucket and the Elasticsearch domain. Each document type has a custom parser, preparing the content in a Queuing bucket. It uses an Amazon SQS queue to buffer work, which is fetched by a Lambda function and analyzed with Amazon Comprehend. Finally, the indexing metadata is saved in Elasticsearch.

Serverless architecture for text extraction and classification.

  1. Documents and images are saved in the Documents bucket.
  2. Depending upon the file type, this triggers a custom parser Lambda function.
    1. For PDFs and DOCX files, the extracted text is stored in a Staging bucket. If the content is longer than 5,000 characters, it is broken into smaller files by a Lambda function, then saved in the Queued bucket.
    2. For JPG files, the parser uses Amazon Rekognition to detect the contents of the image. The labeling metadata is stored in the Queued bucket.
  3. When items are stored in the Queued bucket, this triggers a Lambda function to add the job to an SQS queue.
  4. The Analyze function is invoked when there are messages in the SQS queue. It uses Amazon Comprehend to find entities in the text. This function then stores the metadata in Elasticsearch.

S3 and Lambda both scale to handle the traffic. The Elasticsearch domain is not serverless, however, so it’s possible to overwhelm this instance with requests. There may be a large number of objects stored in the Documents bucket triggering the workflow, so the application uses SQS couple to smooth out the traffic. When large numbers of objects are processed, you see the Messages Available increase in the SQS queue:

SQS queue buffers messages to smooth out traffic.
For the Lambda function consuming messages from the SQS queue, the BatchSize configured in the SAM template controls the rate of processing. The function continues to fetch messages from the SQS queue while Messages Available is greater than zero. This can be a useful mechanism for protecting downstream services that are not serverless, or simply cannot scale to match the processing rates of the upstream application. In this case, it provides a more consistent flow of indexing data to the Elasticsearch domain.

In a production use-case, you would scale the Elasticsearch domain depending upon the load of search requests, not just the indexing traffic from this application. This tutorial uses a minimal Elasticsearch configuration, but this service is capable of supporting enterprise-scale use-cases.

Conclusion

Enterprise document repositories can be a rich source of information but can be difficult to search and index. In this blog post, I show how you can use a serverless approach to build scalable solution easily. With minimum code, we can use Amazon ML services to create the indexing metadata. By using powerful image recognition and language comprehension capabilities, this makes the metadata more useful and the search solution more accurate.

This also shows how serverless solutions can be used with existing non-serverless infrastructure, like Elasticsearch. By decoupling scalable serverless applications, you can protect downstream services from heavy traffic loads, even as Lambda scales up. Elasticsearch provides a fast, easy way to query your document repository once the serverless application has completed the indexing process.

To learn more about how to use Elasticsearch for production workloads, see the documentation on managing domains.

Announcing UltraWarm (Preview) for Amazon Elasticsearch Service

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/announcing-ultrawarm-preview-for-amazon-elasticsearch-service/

Today, we are excited to announce UltraWarm, a fully managed, low-cost, warm storage tier for Amazon Elasticsearch Service. UltraWarm is now available in preview and takes a new approach to providing hot-warm tiering in Amazon Elasticsearch Service, offering up to 900TB of storage, at almost a 90% cost reduction over existing options. UltraWarm is a seamless extension to the Amazon Elasticsearch Service experience, enabling you to query and visualize across both hot and UltraWarm data, all from your familiar Kibana interface. UltraWarm data can be queried using the same APIs and tools you use today, and also supports popular Amazon Elasticsearch Service features like encryption at rest and in flight, integrated alerting, SQL querying, and more.

A popular use case for our customers of Amazon Elasticsearch Service is to ingest and analyze high (and increasingly growing) volumes of machine-generated log data. However, those customers tell us that they want to perform real-time analysis on more of this data, so they can use it to help quickly resolve operational and security issues. Storage and analysis of months, or even years, of data has been cost prohibitive for them at scale, causing some to turn to use multiple analytics tools, while others simply delete valuable data, missing out on insights. UltraWarm, with its cost-effective storage backed by Amazon Simple Storage Service (S3), helps solve this problem, enabling customers to retain years of data for analysis.

With the launch of UltraWarm, Amazon Elasticsearch Service supports two storage tiers, hot and UltraWarm. The hot tier is used for indexing, updating, and providing the fastest access to data. UltraWarm complements the hot tier to add support for high volumes of older, less-frequently accessed, data to enable you to take advantage of a lower storage cost. As I mentioned earlier, UltraWarm stores data in S3 and uses custom, highly-optimized nodes, built on the AWS Nitro System, to cache, pre-fetch, and query that data. This all contributes to providing an interactive experience when querying and visualizing data.

The UltraWarm preview is now available to all customers in the US East (N. Virginia) and US West (Oregon) Regions. The UltraWarm tier is available with a pay-as-you-go pricing model, charging for the instance hours for your node, and utilized storage. The UltraWarm preview can be enabled on new Amazon Elasticsearch Service version 6.8 domains. To learn more, visit the technical documentation.

— Steve

 

Analyzing AWS WAF logs with Amazon ES, Amazon Athena, and Amazon QuickSight

Post Syndicated from Aaron Franco original https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/

AWS WAF now includes the ability to log all web requests inspected by the service. AWS WAF can store these logs in an Amazon S3 bucket in the same Region, but most customers deploy AWS WAF across multiple Regions—wherever they also deploy applications. When analyzing web application security, organizations need the ability to gain a holistic view across all their deployed AWS WAF Regions.

This post presents a simple approach to aggregating AWS WAF logs into a central data lake repository, which lets teams better analyze and understand their organization’s security posture. I walk through the steps to aggregate regional AWS WAF logs into a dedicated S3 bucket. I follow that up by demonstrating how you can use Amazon ES to visualize the log data. I also present an option to offload and process historical data using AWS Glue ETL. With the data collected in one place, I finally show you how you can use Amazon Athena and Amazon QuickSight to query historical data and extract business insights.

Architecture overview

The case I highlight in this post is the forensic use of the AWS WAF access logs to identify distributed denial of service (DDoS) attacks by a client IP address. This solution provides your security teams with a view of all incoming requests hitting every AWS WAF in your infrastructure.

I investigate what the IP access patterns look like over time and assess which IP addresses access the site multiple times in a short period of time. This pattern suggests that the IP address could be an attacker. With this solution, you can identify DDoS attackers for a single application, and detect DDoS patterns across your entire global IT infrastructure.

Walkthrough

This solution requires separate tasks for architecture setup, which allows you to begin receiving log files in a centralized repository, and analytics, which processes your log data into useful results.

Prerequisites

To follow along, you must have the following resources:

  • Two AWS accounts. Following AWS multi-account best practices, create two accounts:
    • A logging account
    • A resource account that hosts the web applications using AWS WAFFor more information about multi-account setup, see AWS Landing Zone. Using multiple accounts isolates your logs from your resource environments. This helps maintain the integrity of your log files and provides a central access point for auditing all application, network, and security logs.
  • The ability to launch new resources into your account. The resources might not be eligible for Free Tier usage and so might incur costs.
  • An application running with an Application Load Balancer, preferably in multiple Regions. If you do not already have one, you can launch any AWS web application reference architecture to test and implement this solution.

For this walkthrough, you can launch an Amazon ECS example from the ecs-refarch-cloudformation GitHub repo. This is a “one click to deploy” example that automatically sets up a web application with an Application Load Balancer. Launch this in two different Regions to simulate a global infrastructure. You ultimately set up a centralized bucket that both Regions log into, which your forensic analysis tools then draw from. Choose Launch Stack to launch the sample application in your Region of choice.

Setup

Architecture setup allows you to begin receiving log files in a centralized repository.

Step 1: Provide permissions

Begin this process by providing appropriate permissions for one account to access resources in another. Your resource account needs cross-account permission to access the bucket in the logging account.

  1. Create your central logging S3 bucket in the logging account and attach the following bucket policy to it under the Permissions Make a note of the bucket’s ARN. You need this information for future steps.
  2. Change RESOURCE-ACCOUNT-ID and CENTRAL-LOGGING-BUCKET-ARNto the correct values based on the actual values in your accounts:
     // JSON Document
     {
       "Version": "2012-10-17",
       "Statement": [
          {
             "Sid": "Cross Account AWS WAF Account 1",
             "Effect": "Allow",
             "Principal": {
                "AWS": "arn:aws:iam::RESOURCE-ACCOUNT-ID:root"
             },
             "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
             "Resource": [
                "CENTRAL-LOGGING-BUCKET-ARN/*"
             ]
          }
       ]
    }

Step 2: Manage Lambda permissions

Next, the Lambda function that you create in your resource account needs permissions to access the S3 bucket in your central logging account so it can write files to that location. You already provided basic cross-account access in the previous step, but Lambda still needs the granular permissions at the resources level. Remember to grant these permissions in both Regions where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. To create an IAM role for the Lambda function, in the Lambda console, choose Policies, Create Policy.
  3. Choose JSON, and enter the following policy document. Replace YOUR-SOURCE-BUCKETand YOUR-DESTINATION-BUCKET with the relative ARNs of the buckets that you are using for this walkthrough.
    // JSON document
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ListSourceAndDestinationBuckets",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:ListBucketVersions"
                ],
                "Resource": [
                    "YOUR-SOURCE-BUCKET",
                    "YOUR-DESTINATION-BUCKET"
                ]
            },
            {
                "Sid": "SourceBucketGetObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:GetObjectVersion"
                ],
                "Resource": "YOUR-SOURCE-BUCKET/*"
            },
            {
                "Sid": "DestinationBucketPutObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject"
                ],
                "Resource": "YOUR-DESTINATION-BUCKET/*"
            }
        ]
     }

  4. Choose Review policy, enter your policy name, and save it.
  5. With the policy created, create a new role for your Lambda function and attach the custom policy to that role. To do this, navigate back to the IAM dashboard.
  6. Select Create roleand choose Lambda as the service that uses the role. Select the custom policy that you created earlier in this step and choose Next. You can add tags if required and then name and create this new role.
  7. You must also add S3 as a trusted entity in the Trust Relationship section of the role. Choose Edit trust relationship and add amazonaws.com to the policy, as shown in the following example.

Lambda and S3 now appear as trusted entities under the Trust relationships tab, as shown in the following screenshot.

Step 3: Create a Lambda function and copy log files

Create a Lambda function in the same Region as your resource account’s S3 bucket. This function reads log files from the resource account bucket and then copies that content to the logging account’s bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. Navigate to Lambda in your console and choose Create Function.
  3. Choose the Author from scratch function and name it. Choose the IAM role you created in the previous step and attach it to the Lambda function.
  4. Choose Create function.
  5. This Lambda function receives a document from S3 that contains nested JSON string data. To handle this data, you must extract the JSON from this string to retrieve key names of both the document and the bucket. Your function then uses this information to copy the data to your central logging account bucket in the next step. To create this function, Copy and paste this code into the Lambda function that you created. Replace the bucket names with the names of the buckets that you created earlier. After you decide on a partitioning strategy, modify this script later.
    // Load the AWS SDK
    const aws = require('aws-sdk');
    
    // Construct the AWS S3 Object 
    const s3 = new aws.S3();
    
    //Main function
    exports.handler = (event, context, callback) => {
        console.log("Got WAF Item Event")
        var _srcBucket = event.Records[0].s3.bucket.name;
        let _key = event.Records[0].s3.object.key;
        let _keySplit = _key.split("/")
        let _objName = _keySplit[ (_keySplit.length - 1) ];
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/YOUR-DESTINATION-BUCKET/"+_objName;
        let _sourcePath = _srcBucket + "/" + _key;
        console.log(_destPath)
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

Step 4: Set S3 to Lambda event triggers

This step sets up event triggers in your resource account’s S3 buckets. These triggers send the file name and location logged by AWS WAF logs to the Lambda function. The triggers also notify the Lambda function that it must move the newly arrived file into your central logging bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Go to the S3 dashboard and choose your S3 bucket, then choose the Properties Under Advanced settings, choose Events.
  2. Give your event a name and select PUT from the Events check boxes.
  3. Choose Lambda from the Send To option and select your Lambda function as the destination for the event.

Step 5: Add AWS WAF to the Application Load Balancer

Add an AWS WAF to the Application Load Balancer so that you can start logging events. You can optionally delete the original log file after Lambda copies it. This reduces costs, but your business and security needs might err on the side of retaining that data.

Create a separate prefix for each Region in your central logging account bucket waf-central-logs so that AWS Glue can properly partition them. For best practices of partitioning with AWS Glue, see Working with partitioned data in AWS Glue. AWS Glue ingests your data and stores it in a columnar format optimized for querying in Amazon Athena. This helps you visualize the data and investigate the potential attacks.

Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF. The procedure assumes that you already have an AWS WAF enabled that you can use for this exercise. To move forward with the next step, you need AWS WAF enabled and connected to Amazon Kinesis Data Firehose for log delivery.

Setting up and configuring AWS WAF

If you don’t already have a web ACL in place, set up and configure AWS WAF at this point. This solution handles logging data from multiple AWS WAF logs in multiple Regions from more than one account.

To do this efficiently, you should consider your partitioning strategy for the data. You can grant your security teams a comprehensive view of the network. Create each partition based on the Kinesis Data Firehose delivery stream for the specific AWS WAF associated with the Application Load Balancer. This partitioning strategy also allows the security team to view the logs by Region and by account. As a result, your S3 bucket name and prefix look similar to the following example:

s3://central-waf-logs/<account_id>/<region_name>/<kinesis_firehose_name>/...filename...

Step 6: Copying logs with Lambda code

This step updates the Lambda function to start copying log files. Keep your partitioning strategy in mind as you update the Lambda function. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

To accommodate the partitioning, modify your Lambda code to match the examples in the GitHub repo.

Replace <kinesis_firehose_name> in the example code with the name of the Kinesis Data Firehose delivery stream attached to the AWS WAF. Replace <central logging bucket name> with the S3 bucket name from your central logging account.

Kinesis Data Firehose should now begin writing files to your central S3 logging bucket with the correct partitioning. To generate logs, access your web application.

Analytics

Now that Kinesis Data Firehose can write collected files into your logging account’s S3 bucket, create an Elasticsearch cluster in your logging account in the same Region as the central logging bucket. You also must create a Lambda function to handle S3 events as the central logging bucket receives new log files. This creates a connection between your central log files and your search engine. Amazon ES gives you the ability to query your logs quickly to look for potential security threats. The Lambda function loads the data into your Amazon ES cluster. Amazon ES also includes a tool named Kibana, which helps with managing data and creating visualizations.

Step 7: Create an Elasticsearch cluster

  1. In your central Logging Account, navigate to the Elasticsearch Service in the AWS Console.
  2. Select Create Cluster, enter a domain name for your cluster, and choose version 3 from the Elasticsearch version dropdown. Choose Next.In this example, don’t implement any security policies for your cluster and only use one instance. For any real-world production tasks, keep your Elasticsearch Cluster inside your VPC.
  3. For network configuration, choose Public access and choose Next.
  4. For the access policy, and this tutorial, only allow access to the domain from a specified Account ID or ARN address. In this case, use your Account ID to gain access.
  5. Choose Next and on the final screen and confirm. You generally want to create strict access policies for your domain and not allow public access. This example only uses these settings to quickly demonstrate the capabilities of AWS services. I would never recommend this in a production environment.

AWS takes a few minutes to finish and activate your Amazon ES. Once it goes live, you can see two endpoints. The Endpoint URL is the URL you use to send data to the cluster.

Step 8: Create a Lambda function to copy log files

Add an event trigger to your central logs bucket. This trigger tells your Lambda function to write the data from the log file to Amazon ES. Before you create the S3 trigger, create a Lambda function in your logging account to handle the events.

For this Lambda function, we use code from the aws-samples GitHub repository that streams data from an S3 file line by line into Amazon ES. This example uses code taken from amazon-elasticsearch-lambda-samples. Name your new Lambda function myS3toES.

  1. Copy and paste the following code into a text file named js:
    exports.handler = (event, context, callback) => {
        // get the source bucket name
        var _srcBucket = event.Records[0].s3.bucket.name;
            // get the object key of the file that landed on S3
        let _key = event.Records[0].s3.object.key;
        
        // split the key by "/"
        let _keySplit = _key.split("/")
            // get the object name
        let _objName = _keySplit[ (_keySplit.length - 1) ];
            // reset the destination path
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/<kinesis_firehose_name>/"+_objName;
            // setup the source path
        let _sourcePath = _srcBucket + "/" + _key;
            // build the params for the copyObject request to S3
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
            // execute the copyObject request
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

  2. Copy and paste this code into a text file and name it json:
    //JSON Document
    {
      "name": "s3toesfunction",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "scripts": {},
      "author": "",
      "dependencies": {
        "byline": "^5.0.0",
        "clf-parser": "0.0.2",
        "path": "^0.12.7",    "stream": "0.0.2"
      }
    }

  3. Execute the following command in the folder containing these files:> npm install
  4. After the installation completes, create a .zip file that includes the js file and the node_modules folder.
  5. Log in to your logging account.
  6. Upload your .zip file to the Lambda function. For Code entry type, choose Upload a .zip file.
  7. This Lambda function needs an appropriate service role with a trust relationship to S3. Choose Edit trust relationships and add amazonaws.com and lambda.amazonaws.com as trusted entities.
  8. Set up your IAM role with the following permissions: S3 Read Only permissions and Lambda Basic Execution. To grant the role the appropriate access, assign it to the Lambda function from the Lambda Execution Role section in the console.
  9. Set Environment variables for your Lambda function so it knows where to send the data. Add an endpoint and use the endpoint URL you created in Step 7. Add an index and enter your index name. Add a value for region and detail the Region where you deployed your application.

Step 9: Create an S3 trigger

After creating the Lambda function, create the event triggers on your S3 bucket to execute that function. This completes your log delivery pipeline to Amazon ES. This is a common pipeline architecture for streaming data from S3 into Amazon S3.

  1. Log in to your central logging account.
  2. Navigate to the S3 console, select your bucket, then open the Properties pane and scroll down to Events.
  3. Choose Add notification and name your new event s3toLambdaToEs.
  4. Under Events, select the check box for PUT. Leave Prefix and Suffix
  5. Under Send to, select Lambda Function, and enter the name of the Lambda function that you created in the previous step—in this example, myS3toES.
  6. Choose Save.

With this complete, Lambda should start sending data to your Elasticsearch index whenever you access your web application.

Step 10: Configure Amazon ES

Your pipeline now automatically adds data to your Elasticsearch cluster. Next, use Kibana to visualize the AWS WAF logs in the central logging account’s S3 bucket. This is the final step in assembling your forensic investigation architecture.

Kibana provides tools to create visualizations and dashboards that help your security teams view log data. Using the log data, you can filter by IP address to see how many times an IP address has hit your firewall each month. This helps you track usage anomalies and isolate potentially malicious IP addresses. You can use this information to add web ACL rules to your firewall that adds extra protection against those IP addresses.

Kibana produces visualizations like the following screenshot.

In addition to the Number of IPs over Time visualization, you can also correlate the IP address to its country of origin. Correlation provides even more precise filtering for potential web ACL rules to protect against attackers. The visualization for that data looks like the following image.

Elasticsearch setup

To set up and visualize your AWS WAF data, follow this How to analyze AWS WAF logs using Amazon Elasticsearch Service post. With this solution, you can investigate your global dataset instead of isolated Regions.

An alternative to Amazon ES

Amazon ES is an excellent tool for forensic work because it provides high-performance search capability for large datasets. However, Amazon ES requires cluster management and complex capacity planning for future growth. To get top-notch performance from Amazon ES, you must adequately scale it. With the more straightforward data of these investigations, you could instead work with more traditional SQL queries.

Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database. To learn more about this option, consult posts like How to extract, transform, and load data from analytic processing using AWS Glue and Work with partitioned data in AWS Glue.

Athena query

With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization. Use the Athena console to experiment until you have the best query for your visual needs. Having the database in your AWS Glue Catalog means you can make ad hoc queries in Athena to inspect your data.

In the Athena console, create a new Query tab and enter the following query:

# SQL Query
SELECT date_format(from_unixtime("timestamp"/1000), '%Y-%m-%d %h:%i:%s') as event_date, client_ip, country, account_id, waf_name, region FROM "paritionedlogdata"."waf_logs_transformed" where year='2018' and month='12';

Replace <your-database-name> and <your-table-name> with the appropriate values for your environment. This query converts the numerical timestamp to an actual date format using the SQL according to Presto 0.176 documentation. It should return the following results.

You can see which IP addresses hit your environment the most over any period of time. In a production environment, you would run an ETL job to re-partition this data and transform it into a columnar format optimized for queries. If you would like more information about doing that, see the Best Practices When Using Athena with AWS Glue post.

Amazon QuickSight visualization

Now that you can query your data in Athena, you can visualize the results using Amazon QuickSight. First, grant Amazon QuickSight access to the S3 bucket where your Athena query results live.

  1. In the Amazon QuickSight console, log in.
  2. Choose Admin/username, Manage QuickSight.
  3. Choose Account settings, Security & permissions.
  4. Under QuickSight access to AWS services, choose Add or remove.
  5. Choose Amazon S3, then choose Select S3 buckets.
  6. Choose the output bucket for your central AWS WAF logs. Also, choose your Athena query results bucket. The query results bucket begins with aws-athena-query-results-*.

Amazon QuickSight can now access the data sources. To set up your visualizations, follow these steps:

  1. In the QuickSight console, choose Manage data, New data set.
  2. For Source, choose Athena.
  3. Give your new dataset a name and choose Validate connection.
  4. After you validate the connection, choose Create data source.
  5. Select Use custom SQL and give your SQL query a name.
  6. Input the same query that you used earlier in Athena, and choose Confirm query.
  7. Choose Import to SPICE for quicker analytics, Visualize.

Allow Amazon QuickSight several minutes. It alerts you after completing the import.

Now that you have imported your data into your analysis, you can apply a visualization:

  1. In Amazon QuickSight, New analysis.
  2. Select the last dataset that you created earlier and choose Create analysis.
  3. At the bottom left of the screen, choose Line Chart.
  4. Drag and drop event_date to the X-Axis
  5. Drag and drop client_ip to the ValueThis should create a visualization similar to the following image.
  6. Choose the right arrow at the top left of the visualization and choose Hide “other” categories.This should modify your visualization to look like the following image.

You can also map the countries from which the requests originate, allowing you to track global access anomalies. You can do this in QuickSight by selecting the “Points on map” visualization type and choosing the country as the data point to visualize.

You can also add a count of IP addresses to see if you have any unusual access patterns originating from specific IP addresses.

Conclusion

Although Amazon ES and Amazon QuickSight offer similar final results, there are trade-offs to the technical approaches that I highlighted. If your use case requires the analysis of data in real time, then Amazon ES is more suitable for your needs. If you prefer a serverless approach that doesn’t require capacity planning or cluster management, then the solution with AWS Glue, Athena, and Amazon QuickSight is more suitable.

In this post, I described an easy way to build operational dashboards that track key metrics over time. Doing this with AWS Glue, Athena, and Amazon QuickSight relieves the heavy lifting of managing servers and infrastructure. To monitor metrics in real time instead, the Amazon ES solution provides a way to do this with little operational overhead. The key here is the adaptability of the solution: putting different services together can provide different solutions to your problems to fit your exact needs.

For more information and use cases, see the following resources:

Hopefully, you have found this post informative and the proposed solutions intriguing. As always, AWS welcomes all feedback or comment.

 


About the Authors

Aaron Franco is a solutions architect at Amazon Web Services .

 

 

 

 

 

 

 

 

Setting alerts in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/setting-alerts-in-amazon-elasticsearch-service/

Customers often use Amazon Elasticsearch Service for log analytics. Amazon ES lets you collect logs from your infrastructure, transform each log line into a JSON document, and send those documents to the bulk API.

A transformed log line contains many fields, each containing values. For instance, an Apache web log line includes a source IP address field, a request URL field, and a status code field (among others). Many users build dashboards—using Kibana—to monitor their infrastructure visually, surfacing application usage, bugs, or security problems evident from the data in those fields. For example, you can graph the count of HTTP 5xx status codes, then watch and react to changes. If you see a sudden jump in 5xx codes, you likely have a server issue. But with this system, you must monitor Kibana manually.

On April 8, Amazon ES launched support for event monitoring and alerting. To use this feature, you work with monitors—scheduled jobs—that have triggers, which are specific conditions that you set, telling the monitor when it should send an alert. An alert is a notification that the triggering condition occurred. When a trigger fires, the monitor takes action, sending a message to your destination.

This post uses a simulated IoT device farm to generate and send data to Amazon ES.

Simulator overview

This simulation consists of several important parts: sensors and devices.

Sensors

The core class for the simulator is the sensor. Devices have sensors that simulate different patterns of floating-point values. When called, each sensor’s report method updates and returns the value of its sensor. There are several subclasses for Sensor:

  • SineSensor: Produces a sin wave, based on the current timestamp.
  • ConstSensor: Produces a constant value. The class includes a random “fuzz” factor to drift around a particular value.
  • DriftingSensor: Allows continuous, random drift with a starting value.
  • MonotonicSensor: Increments its value by a constant delta, with random fuzz.

For this post, I used MonotonicSensor, whose value constantly increases, to force a breach in an alert that I set up.

You can identify a sensor by a universally unique identifier (UUID) and a label for the metric that it tracks. The report function for the sensor class returns a timestamp, the UUID of the sensor, the metric label, and the metric’s value at that instant.

Devices

Devices are collections of sensors. For this post, I created a collection of devices that simulate IoT devices in a field, measuring the temperature and humidity, and sending the CPU of the device. Each has a report method that recursively calls the report methods for all their sensors, returning a collection of the sensor reports. I made the code available in the Open Distro for Elasticsearch sample code repository on GitHub.

I set the CPU sensor of one device to drift constantly upward, simulating a problem in the device. You can see the intended “bad behavior” in the following line graph:

In the next sections, I set up an alert at 90% CPU so that I can catch and correct the situation.

Prerequisites

To follow along with this solution, you need an AWS account. Set up your own Amazon ES domain, to form the basis of your monitors and alerts.

Step 1: Set up your destination

When you create alerts in Amazon ES, you assign one destination or multiple. A destination is a delivery channel, where your domain sends notifications when your alerts trigger. You can use Amazon SNS, your Slack channel, or Amazon Chime as your destination. Or, you can set up a custom webhook (a URL) to receive messages. You set up headers and the message body, and Amazon ES Alerts posts the message to the destination URL.

I use SNS to receive alerts from my Amazon ES domain in this example, but Amazon ES provides many options for setting up your topic and subscription. I created the topic to receive notifications and subscribe to the topic for email delivery. Your SNS topic can have many subscriptions, supporting delivery via HTTP/S endpoint, email, Amazon SQS, AWS Lambda, and SMS.

To set up your destination, navigate to the AWS Management Console. Sign in and open the SNS console.

Choose Topics, Create Topic.

In the Create topic page, fill out values for Name and Display name. I chose sensor-alerting for both. Choose Create topic.

Now subscribe to your topic. You can do this from the topic page, as the console automatically returns you there when you complete topic creation. You can also subscribe from the Subscriptions tab in the left navigation pane. From the topic page, choose Create subscription.

On the Create Subscription page, for Protocol, choose Email. Fill in your email address in the Endpoint box and choose Create subscription. Make a note of the Topic ARN here, as you refer to it again later.

Finally, confirm your subscription by clicking the confirmation link in the email that SNS sends to you.

Step 2: Set up a role

To let Amazon ES publish alerts to your topic, create an IAM role with the proper permissions. Before you get started, copy the Topic ARN from the SNS topic page in Step 1.

Your role has two components: trusted entities and permissions for entities that assume the role. The console doesn’t support creating a role with Amazon ES as a trusted entity. Create a role with EC2 as the trusted entity and then edit the JSON trust document to change the entity.

In the AWS Management Console, open the IAM console and choose Roles, Create role.

On the Create role page, choose AWS Service and EC2. Choose Next: Permissions.

On the permissions page, choose Create policy. This brings you to a new window to create the policy. Don’t close the old tab, as you return to it in a moment.

The policy that you create in this step defines the permissions for entities that assume the role. Add a policy document that allows various entities (Amazon ES in this case) to publish to your SNS topic.

On the Create policy page, choose the JSON tab and copy-paste to replace the JSON text with the following code. Replace the sns-topic-arn in the code with the ARN for the topic that you created earlier. After you have done this, choose Review policy.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sns:Publish",
    "Resource": "sns-topic-arn"
  }]
}

On the Review policy page, give your policy a name. I chose SensorAlertingPolicy in this example. Choose Create policy.

Return to the Create role window or tab. Use the refresh button to reload the policies and type the name of your policy in the search box. Select the check box next to your policy. Choose Next: Tags, then choose Next: Review. You can also add tags to make your role easier to search.

On the Review page, give your role a name. I used SensorAlertingRole in this example. Choose Create role.

To change the trusted entity for the role to Amazon ES, in the IAM console, choose Roles. Type SensorAlertingRole in the search box, and choose the link (not the check box) to view that role. Choose Trust relationships, Edit trust relationship.

Edit the Policy Document code to replace ec2.amazonaws.com with es.amazonaws.com. Your completed policy document should look like the following code example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "es.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Choose Update Trust Policy. Make a note of your role ARN, as you refer to it again.

Step 3: Set up Amazon ES alerting

I pointed my IoT sensor simulator at my Amazon ES domain. This creates data that serves as the basis for the monitors and alerts. To do this yourself, navigate to your Kibana endpoint in your browser and choose Alerting in the left navigation pane. At the top of the window, choose Destinations, Add Destination.

In the Add Destination dialog, give your destination a name. For Type, choose SNS, and set the SNS topic ARN to the topic ARN that you created in Step 1. Set the IAM role ARN to the role ARN that you created in Step 2. Choose Create. You can set as many destinations as you like, allowing you to alert multiple people in the event of a problem.

Step 4: Set up a monitor

Monitors in Open Distro for Amazon ES allow you to specify a value to monitor. You can select the value either graphically or by specifying an Amazon ES query. You define a monitor first and then define triggers for the monitored value.

In Kibana, choose Monitors, Create Monitor.

Give your monitor a name. I named my monitor Device CPUs. You can set the frequency to one of the predefined intervals, or use a cron expression for more granular control. I chose Every 1 minute.

Scroll to the Define Monitor section of the page. Use this set of controls to specify the value to monitor. You can enter a value for Index or Indexes, Time field, and a target value. Choose Define using visual graph from the How do you want to define your monitor? list. You can also enter information for Define using extraction query, allowing you to provide a query that produces a value to monitor. For simple thresholds, the visual interface is fast and easy.

Select the Index value to monitor from the list. The list contains individual indexes. To use a wildcard, you can also type in the text box. For the value to register, you must press Enter after typing the index name (for example, “logs-*” <enter>).

Choose a value for Time field from the list. This reveals several selectors on top of a graph. Choose Count() and open the menu to see the aggregations for computing the value. Choose max(), then choose CPU for Select a field. Finally, set FOR THE LAST to 5 minute(s). Choose Create.

You can create your monitor visually or provide a query to produce the value to monitor.

I chose the logs-* index to monitor the max value of the CPU field, but this doesn’t create a trigger yet. Choose Create. This brings you to the Define Trigger page.

Step 5: Create a trigger

To create a trigger, specify the threshold value for the field that you’re monitoring. When the value of the field exceeds the threshold, the monitor enters an Active state. I created a trigger called CPU Too High, with a threshold value of 90 and a severity level of 1.

When you set the trigger conditions, set the action or actions that Amazon ES performs.

To add actions, scroll through the page. I added one action to send a message to my SNS topic—including the monitor name, trigger, severity, and the period over which the alarm has been active. You can use Mustache scripting to create a template for the message that you receive.

After you finish adding actions, choose Create at the bottom of the page.

Wrap up

When you return to the Alerting Dashboard, your alert appears in the Completed state. Alerts can exist in a variety of states. Completed signals that the monitor successfully queried your target, and that the trigger is not engaged.

To send the alert into the Active state, I sent simulated sensor data with a failing device whose CPU ramped up from 50% to 100%. When it hit 90%, I received the following email:

Conclusion

In this post, I demonstrated how Amazon ES alerting lets you monitor the critical data in your log files so that you can respond quickly when things start to go wrong. By identifying KPIs, setting thresholds, and distributing alerts to your first responders, you can improve your response time for critical issues.

If you have questions or feedback, leave them below, or reach out on Twitter!


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

 

 

Alerting, monitoring, and reporting for PCI-DSS awareness with Amazon Elasticsearch Service and AWS Lambda

Post Syndicated from Michael Coyne original https://aws.amazon.com/blogs/security/alerting-monitoring-and-reporting-for-pci-dss-awareness-with-amazon-elasticsearch-service-and-aws-lambda/

Logging account activity within your AWS infrastructure is paramount to your security posture and could even be required by compliance standards such as PCI-DSS (Payment Card Industry Security Standard). Organizations often analyze these logs to adapt to changes and respond quickly to security events. For example, if users are reporting that their resources are unable to communicate with the public internet, it would be beneficial to know if a network access list had been changed just prior to the incident. Many of our customers ship AWS CloudTrail event logs to an Amazon Elasticsearch Service cluster for this type of analysis. However, security best practices and compliance standards could require additional considerations. Common concerns include how to analyze log data without the data leaving the security constraints of your private VPC.

In this post, I’ll show you not only how to store your logs, but how to put them to work to help you meet your compliance goals. This implementation deploys an Amazon Elasticsearch Service domain with Amazon Virtual Private Cloud (Amazon VPC) support by utilizing VPC endpoints. A VPC endpoint enables you to privately connect your VPC to Amazon Elasticsearch without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. An AWS Lambda function is used to ship AWS CloudTrail event logs to the Elasticsearch cluster. A separate AWS Lambda function performs scheduled queries on log sets to look for patterns of concern. Amazon Simple Notification Service (SNS) generates automated reports based on a sample set of PCI guidelines discussed further in this post and notifies stakeholders when specific events occur. Kibana serves as the command center, providing visualizations of CloudTrail events that need to be logged based on the provided sample set of PCI-DSS compliance guidelines. The automated report and dashboard that are constructed around the sample PCI-DSS guidelines assist in event awareness regarding your security posture and should not be viewed as a de facto means of achieving certification. This solution serves as an additional tool to provide visibility in to the actions and events within your environment. Deployment is made simple with a provided AWS CloudFormation template.
 

Figure 1: Architectural diagram

Figure 1: Architectural diagram

The figure above depicts the architecture discussed in this post. An Elasticsearch cluster with VPC support is deployed within an AWS Region and Availability Zone. This creates a VPC endpoint in a private subnet within a VPC. Kibana is an Elasticsearch plugin that resides within the Elasticsearch cluster, it is accessed through a provided endpoint in the output section of the CloudFormation template. CloudTrail is enabled in the VPC and ships CloudTrail events to both an S3 bucket and CloudWatch Log Group. The CloudWatch Log Group triggers a custom Lambda function that ships the CloudTrail Event logs to the Elasticsearch domain through the VPC endpoint. An additional Lambda function is created that performs a periodic set of Elasticsearch queries and produces a report that is sent to an SNS Topic. A Windows-based EC2 instance is deployed in a public subnet so users will have the ability to view and interact with a Kibana dashboard. Access to the EC2 instance can be restricted to an allowed CIDR range through a parameter set in the CloudFormation deployment. Access to the Elasticsearch cluster and Kibana is restricted to a Security Group that is created and is associated with the EC2 instance and custom Lambda functions.

Sample PCI-DSS Guidelines

This solution provides a sample set of (10) PCI-DSS guidelines for events that need to be logged.

  • All Commands, API action taken by AWS root user
  • All failed logins at the AWS platform level
  • Action related to RDS (configuration changes)
  • Action related to enabling/disabling/changing of CloudTrail, CloudWatch logs
  • All access to S3 bucket that stores the AWS logs
  • Action related to VPCs (creation, deletion and changes)
  • Action related to changes to SGs/NACLs (creation, deletion and changes)
  • Action related to IAM users, roles, and groups (creation, deletion and changes)
  • Action related to route tables (creation, deletion and changes)
  • Action related to subnets (creation, deletion and changes)

Solution overview

In this walkthrough, you’ll create an Elasticsearch cluster within an Amazon VPC environment. You’ll ship AWS CloudTrail logs to both an Amazon S3 Bucket (to maintain an immutable copy of the logs) and to a custom AWS Lambda function that will stream the logs to the Elasticsearch cluster. You’ll also create an additional Lambda function that will run once a day and build a report of the number of CloudTrail events that occurred based on the example set of 10 PCI-DSS guidelines and then notify stakeholders via SNS. Here’s what you’ll need for this solution:

To make it easier to get started, I’ve included an AWS CloudFormation template that will automatically deploy the solution. The CloudFormation template along with additional files can be downloaded from this link. You’ll need the following resources to set it up:

  • An S3 bucket to upload and store the sample AWS Lambda code and sample Kibana dashboards. This bucket name will be requested during the CloudFormation template deployment.
  • An Amazon Virtual Private Cloud (Amazon VPC).

If you’re unfamiliar with how CloudFormation templates work, you can find more info in the CloudFormation Getting Started guide.

AWS CloudFormation deployment

The following parameters are available in this template.

ParameterDefaultDescription
Elasticsearch Domain NameName of the Amazon Elasticsearch Service domain.
Elasticsearch Version6.2Version of Elasticsearch to deploy.
Elasticsearch Instance Count3The number of data nodes to deploy in to the Elasticsearch cluster.
Elasticsearch Instance ClassThe instance class to deploy for the Elasticsearch data nodes.
Elasticsearch Instance Volume Size10The size of the volume for each Elasticsearch data node in GB.
VPC to launch intoThe VPC to launch the Amazon Elasticsearch Service cluster into.
Availability Zone to launch intoThe Availability Zone to launch the Amazon Elasticsearch Service cluster into.
Private Subnet IDThe subnet to launch the Amazon Elasticsearch Service cluster into.
Elasticsearch Security GroupA new Security Group is created that will be associated with the Amazon Elasticsearch Service cluster.
Security Group DescriptionA description for the above created Security Group.
Windows EC2 Instance Classm5.largeWindows instance for interaction with Kibana.
EC2 Key PairEC2 Key Pair to associate with the Windows EC2 instance.
Public SubnetPublic subnet to associate with the Windows EC2 instance for access.
Remote Access Allowed CIDR0.0.0.0/0The CIDR range to allow remote access (port 3389) to the EC2 instance.
S3 Bucket Name—Lambda FunctionsS3 Bucket that contains custom AWS Lambda functions.
Private SubnetPrivate subnet to associate with AWS Lambda functions that are deployed within a VPC.
CloudWatch Log Group NameThis will create a CloudWatch Log Group for the AWS CloudTrail event logs.
S3 Bucket Name—CloudTrail loggingThis will create a new Amazon S3 Bucket for logging CloudTrail events. Name must be a globally unique value.
Date range to perform queriesnow-1d(examples: now-1d, now-7d, now-90d)
Lambda Subnet CIDRCreate a Subnet CIDR to deploy AWS Lambda Elasticsearch query function in to
Availability Zone—LambdaThe availability zone to associate with the preceding AWS Lambda Subnet
Email Address[email protected]Email address for reporting to notify stakeholders via SNS. You must accept the subscription by selecting the link sent to this address before alerts will arrive.

It takes 30-45 minutes for this stack to be created. When it’s complete, the CloudFormation console will display the following resource values in the Outputs tab. These values can be referenced at any time and will be needed in the following sections.

oElasticsearchDomainEndpointElasticsearch Domain Endpoint Hostname
oKibanaEndpointKibana Endpoint Hostname
oEC2InstanceWindows EC2 Instance Name used for Kibana access
oSNSSubscriberSNS Subscriber Email Address
oElasticsearchDomainArnArn of the Elasticsearch Domain
oEC2InstancePublicIpPublic IP address of the Windows EC2 instance

Managing and testing the solution

Now that you’ve set up the environment, it’s time to configure the Kibana dashboard.

Kibana configuration

From the AWS CloudFormation output, gather information related to the Windows-based EC2 instance. Once you have retrieved that information, move on to the next steps.

Initial configuration and index pattern

  1. Log into the Windows EC2 instance via Remote Desktop Protocol (RDP) from a resource that is within the allowed CIDR range for remote access to the instance.
  2. Open a browser window and navigate to the Kibana endpoint hostname URL from the output of the AWS CloudFormation stack. Access to the Elasticsearch cluster and Kibana is restricted to the security group that is associated with the EC2 instance and custom Lambda functions during deployment.
  3. In the Kibana dashboard, select Management from the left panel and choose the link for Index Patterns.
  4. Add one index pattern containing the following: cwl-*
     
    Figure 2: Define the index pattern

    Figure 2: Define the index pattern

  5. Select Next Step.
  6. Select the Time Filter Field named @timestamp.
     
    Figure 3: Select "@timestamp"

    Figure 3: Select “@timestamp”

  7. Select Create index pattern.

At this point we’ve launched our environment and have accessed the Kibana console. Within the Kibana console, we’ve configured the index pattern for the CloudWatch logs that will contain the CloudTrail events. Next, we’ll configure visualizations and a dashboard.

Importing sample PCI DSS queries and Kibana dashboard

  1. Copy the export.json from the location you extracted the downloaded zip file to the EC2 Kibana bastion.
  2. Select Management on the left panel and choose the link for Saved Objects.
  3. Select Import in upper right corner and navigate to export.json.
  4. Select Yes, overwrite all saved objects, then select Index Pattern cwl-* and confirm all changes.
  5. Once the import completes, select PCI DSS Dashboard to see the sample dashboard and queries.

Note: You might encounter an error during the import that looks like this:
 

Figure 4: Error message

Figure 4: Error message

This simply means that your streamed logs do not have login-type events in the time period since your deployment. To correct this, you can add a field with a null event.

  1. From the left panel, select Dev Tools and copy the following JSON into the left panel of the console:
    
            POST /cwl-/default/
            {
                "userIdentity": {
                    "userName": "test"
                }
            }              
     

  2. Select the green Play triangle to execute the POST of a document with the missing field.
     
    Figure 5: Select the "Play" button

    Figure 5: Select the “Play” button

  3. Now reimport the dashboard using the steps in Importing Sample PCI DSS Queries and Kibana Dashboard. You should be able to complete the import with no errors.

At this point, you should have CloudTrail events that have been streamed to the Elasticsearch cluster, with a configured Kibana dashboard that looks similar to the following graphic:
 

Figure 6: A configured Kibana dashboard

Figure 6: A configured Kibana dashboard

Automated Reports

A custom AWS Lambda function was created during the deployment of the Amazon CloudFormation stack. This function uses the sample PCI-DSS guidelines from the Kibana dashboard to build a daily report. The Lambda function is triggered every 24 hours and performs a series of Elasticsearch time-based queries of now-1day (the last 24 hours) on the sample guidelines. The results are compiled into a message that is forwarded to Amazon Simple Notification Service (SNS), which sends a report to stakeholders based on the email address you provided in the CloudFormation deployment.

The Lambda function will be named <CloudFormation Stack Name>-ES-Query-LambdaFunction. The Lambda Function enables environment variables such as your query time window to be adjusted or additional functionality like additional Elasticsearch queries to be added to the code. The below sample report allows you to monitor any events against the sample PCI-DSS guidelines. These reports can then be further analyzed in the Kibana dashboard.


    Logging Compliance Report - Wednesday, 11. July 2018 01:06PM
    Violations for time period: 'now-1d'
    
    All Failed login attempts
    - No Alerts Found
    All Commands, API action taken by AWS root user
    - No Alerts Found
    Action related to RDS (configuration changes)
    - No Alerts Found
    Action related to enabling/disabling/changing of CloudTrail CloudWatch logs
    - 3 API calls indicating alteration of log sources detected
    All access to S3 bucket that stores the AWS logs
    - No Alerts Found
    Action related to VPCs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to SGs/NACLs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to IAM roles, users, and groups (creation, deletion and changes)
    - 2 API calls indicating creation, alteration or deletion of IAM roles, users, and groups
    Action related to changes to Route Tables (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to Subnets (creation, deletion and changes)
    - No Alerts Found         

Summary

At this point, you have now created a private Elasticsearch cluster with Kibana dashboards that monitors AWS CloudTrail events on a sample set of PCI-DSS guidelines and uses Amazon SNS to send a daily report providing awareness in to your environment—all isolated securely within a VPC. In addition to CloudTrail events streaming to the Elasticsearch cluster, events are also shipped to an Amazon S3 bucket to maintain an immutable source of your log files. The provided Lambda functions can be further modified to add additional or more complex search queries and to create more customized reports for your organization. With minimal effort, you could begin sending additional log data from your instances or containers to gain even more insight as to the security state of your environment. The more data you retain, the more visibility you have into your resources and the closer you are to achieving Compliance-on-Demand.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Michael Coyne

Michael is a consultant for AWS Professional Services. He enjoys the fast-paced environment of ever-changing technology and assisting customers in solving complex issues. Away from AWS, Michael can typically be found with a guitar and spending time with his wife and two young kiddos. He holds a BS in Computer Science from WGU.

How to analyze AWS WAF logs using Amazon Elasticsearch Service

Post Syndicated from Tom Adamski original https://aws.amazon.com/blogs/security/how-to-analyze-aws-waf-logs-using-amazon-elasticsearch-service/

Log analysis is essential for understanding the effectiveness of any security solution. It can be valuable for day-to-day troubleshooting and also for your long-term understanding of how your security environment is performing.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources. AWS WAF gives you control over which traffic to allow or block to your web applications by defining customizable web security rules.

With the release of access to full AWS WAF logs, you now have the ability to view all of the logs generated by AWS WAF while it’s protecting your web applications. In addition, you can use Amazon Kinesis Data Firehose to forward the logs to Amazon Simple Storage Service (Amazon S3) for archival, and to Amazon Elasticsearch Service to analyze access to your web applications.

In this blog post, I’ll show you how you can analyze AWS WAF logs using Amazon Elasticsearch Service (Amazon ES). I’ll walk you through how to find out in near-real time which AWS WAF rules get triggered, why, and by which request. I’ll also show you how to create a historical view of your web applications’ access trends for long-term analysis.

Architecture overview

When enabled, AWS WAF sends logs to Amazon Kinesis Data Firehose. From there, you can decide what to do with them next. In my architecture, I’ll forward the relevant logs to Amazon ES for analysis and to Amazon S3 for long-term storage.
 

Figure 1: Architecture overview

Figure 1: Architecture overview

In this blog, I will show you how to enable AWS WAF logging to Amazon ES in the following steps:

  1. Create an Amazon Elasticsearch Service domain.
  2. Configure Kinesis Data Firehose for log delivery.
  3. Enable AWS WAF logs on a web access control list (ACL) to send data to Kinesis Data Firehose.
  4. Analyze AWS WAF logs in Amazon ES.

Note: The region in which you should enable Kinesis Data Firehose must match the region where your AWS WAF web ACL is deployed. If you’re enabling logging for AWS WAF deployed on Amazon CloudFront, then Kinesis Data Firehose must be created in the Northern Virginia AWS Region.

Preparing Amazon Elasticsearch Service

If you don’t have your Amazon ES environment already running on AWS, you’ll need to create one. Start by navigating to the Elasticsearch Service from your AWS Console and choosing Create a new domain. There, you will be able to define the namespace for your Elasticsearch cluster and choose the version you’d like to use. In my setup, I’ll be using awswaf-logs as my namespace and version 6.3 of Elasticsearch.
 

Figure 2: Defining Amazon ES Domain

Figure 2: Defining Amazon ES Domain

You can also decide on other parameters, like the number of nodes in your cluster or how to access the dashboards. See the Amazon Elasticsearch Service Documentation to find out more details about setting up your environment, or review this previous blog post, which goes deeper into Amazon ES setup.

When your environment is ready, you’ll be able to click on your namespace and find the link to your Kibana dashboard. Kibana is the data visualisation plugin for Elasticsearch that you’ll use to analyze your logs.
 

Figure 3: Identifying the Kibana management URL

Figure 3: Identifying the Kibana management URL

While Elasticsearch can automatically classify most of the fields from the AWS WAF logs, you need to inform Elasticsearch how to interpret fields that have specific formatting. Therefore, before you start sending logs to Elasticsearch, you should create an index pattern template so Elasticsearch can distinguish AWS WAF logs and classify the fields correctly.

The pattern template I’m using below will be applied to all indexes beginning with the string awswaf-. That’s because in my setup all AWS WAF log index files created in Elasticsearch will always begin with the character string awswaf- followed by a date.

The pattern template is defining two fields from the logs. It will indicate to Elasticsearch that the httpRequest.clientIp field is using an IP address format and that the timestamp field is represented in epoch time. All the other log fields will be classified automatically.

In the template, I’m also setting the number of Elasticsearch shards that should be used for the index. Shards help with distributing large amount of data and their number depends on the size of the index—the larger the index the more shards it should be deployed across. I’m going to use only one shard since I expect my index to be less than 30 GB. For recommendations on sizing your shards appropriately, refer to Jon Handler’s blog post on using Amazon Elasticsearch Service.


PUT  _template/awswaf-logs
{
    "index_patterns": ["awswaf-*"],
    "settings": {
    "number_of_shards": 1
    },
    "mappings": {
      "waflog": {
        "properties": {
          "httpRequest": {
            "properties": {
              "clientIp": {
                "type": "keyword",
                "fields": {
                  "keyword": {
                    "type": "ip"
                  }
                }
              }
            }
          },
          "timestamp": {
            "type": "date",
            "format": "epoch_millis"
          }
      }
    }
  }
}

I’m going to apply this template to Elasticsearch through the Kibana developer interface (from the Dev Tools tab in the Kibana dashboard), but it can also be done through an API call to the Amazon ES endpoint.
 

Figure 4: Applying the index pattern template in Kibana

Figure 4: Applying the index pattern template in Kibana

The template will be applied to all log indexes starting with the awswaf- prefix. This is the prefix name you’ll use at a later stage in the Kinesis Data Firehose configuration.

Setting up Kinesis for log delivery

With Amazon ES now ready to receive AWS WAF logs, I’ll show you how to use Amazon Kinesis Data Firehose to deliver your logs to Amazon S3 and Amazon ES. This is a prerequisite before you enable logging in AWS WAF.

First, go to the Kinesis service in the AWS Console and under the Data Firehose tab choose Create delivery stream. If the Data Firehose stream is going to be used for WAF logs, its name must start with aws-waf-logs-. I’ll enable logging for the already existing web ACL that I’m using for OWASP top 10 protection, therefore I’ll name my delivery stream aws-waf-logs-owasp.

Keep all the settings at the default until you get to the Select destination page. On this page, you’ll configure Amazon Elasticsearch Service as the destination for your logs. Make sure you enter awswaf as the index name and waflog as the type to match the index pattern template you already created in Elasticsearch earlier.
 

Figure 5: Setting up Amazon ES as Data Firehose log destination

Figure 5: Setting up Amazon ES as Data Firehose log destination

On this page, you can also set what to back up to Amazon S3—either all records or just the failed ones. You’ll then need to decide on the bucket to use, specify the prefix, and then select Next.
 

Figure 6: Setting up S3 backup in Data Firehose

Figure 6: Setting up S3 backup in Data Firehose

On the next page, you’ll have the option to update the buffer conditions—size and interval. These indicate how often Kinesis Data Firehose will send new data to Amazon ES. The lower the values you set, the closer to real time your log delivery will be. In my example, I’ve configured a buffer size of 1 MB and a buffer interval of 60 seconds.
 

Figure 7: Setting up Data Firehose buffer conditions

Figure 7: Setting up Data Firehose buffer conditions

Next, you’ll need to specify an IAM role that enables Kinesis Data Firehose to send data to Amazon S3 and Amazon ES. You can select an existing IAM role or create a new one. For more information about creating a role, see Controlling Access with Amazon Kinesis Data Firehose.

Finally, review all the settings, and complete the creation of your Kinesis Data Firehose.

Enabling AWS WAF logging

Now that you have Kinesis Data Firehose prepared, you can enable logging for an existing AWS WAF Web ACL. You can do this in the AWS Console under the AWS WAF & AWS Shield service. Go to the Web ACLs tab and select the Web ACL for which you want to start logging. Then, under the Logging tab, choose Enable Logging.
 

Figure 8: Enabling logging for AWS WAF web ACL

Figure 8: Enabling logging for AWS WAF web ACL

On the next page, specify the Kinesis Data Firehose that the logs should be delivered to. If you can’t see any options in the drop-down list, make sure that the Kinesis Data Firehose you configured earlier is deployed in the same region as the web ACL you’re enabling logging for and that its name begins with aws-waf-logs-. Remember that for CloudFront web ACL deployments, the Kinesis Data Firehose should be created in the Northern Virginia Region.

Searching through logs

As soon as you log back into your Kibana dashboard for Amazon ES, you should see a new index appear called awswaf-<date>. You’ll need to define an index pattern that will encompass all the potential awswaf log entries. Configure your index pattern description as awswaf-*, and on the next screen select timestamp as the time filter.
 

Figure 9: Configuring the Kibana index pattern

Figure 9: Configuring the Kibana index pattern

 

Figure 10: Configuring a time filter for the index pattern

Figure 10: Configuring a time filter for the index pattern

After you complete the setup of your index pattern, you’ll be able to run searches through your logs by going into the Discover tab in Kibana. The searches can be based on any fields in the AWS WAF logs. For example, you can look for specific HTTP headers, query strings, or source IP addresses to find out what action has been applied to them. To find out more about what fields are available in the AWS WAF logs, see the AWS WAF Developer Guide.

In the example search that follows, I’m looking for all of the log entries where a web ACL has blocked the traffic and the HTTP arguments were equal to name=badname.
 

Figure 11: An example ad-hoc search

Figure 11: An example ad-hoc search

This approach can help you identify specific request based on their parameters, such as a cookie, host header, query string, and understand why they are being blocked (or allowed) and by which web ACL. It is especially useful for tuning AWS WAF to match your web application.

Building a dashboard

You can also use Amazon ES to create custom graphs to look at long-term trends—and you can combine your graphs into a dashboard. For example, you may want to see the number of blocked and allowed requests by IP address to identify potentially hostile IP addresses.

To understand the breakdown of the requests per IP address, you should first set up a graph showing the number of requests that were blocked and allowed by AWS WAF per source. To do this, go to the Visualize section of your Kibana dashboard and create a new visualization. You can select from a variety of visualization options, but I’m going to use the horizontal bar chart for the example that follows.
 

Figure 12: Configuring primary bucket parameters

Figure 12: Configuring primary bucket parameters

  1. Select your index and then configure your graph options. For a horizontal bar chart, leave the Y axis to represent the count of requests, but expand the primary bucket (X-Axis) form and change the fields to show the source IP address and action:
    1. From the Aggregation dropdown, select Terms.
    2. From the Field dropdown, select httpRequest.clientIP.
    3. From the Order By dropdown, select metric: Count.
       
      Figure 13: Configuring bucket series

      Figure 13: Configuring bucket series

  2. Next, expand the Split Series section and adjust the following dropdown fields:
    1. From the Sub Aggregation dropdown, select Terms.
    2. From the Field dropdown select action.keyword.
    3. From the Order By dropdown, select metric: Count.
       
      Figure 14: Sample Kibana graphs

      Figure 14: Sample Kibana graphs

    4. Apply your changes by selecting the play button at the top of the page. You should see the results appear on the right side of the screen.
       
      Figure 15: Select the "Apply changes" button

      Figure 15: Select the “Apply changes” button

    You can combine graphs into dashboards that aggregate relevant information about AWS WAF functionality. In the sample dashboard below, I’m showing multiple visualisations focusing on different dimensions like request per country, IP address, and even which WAF rules get the most hits.
     

    Figure 16: You can combine graphs

    Figure 16: You can combine graphs

    I’ve made the sample dashboard and sample visualizations available for download. They can be imported to your Kibana instance in the Management/SavedObjects section of the dashboard.

    Conclusion

    The ability to access AWS WAF logs for any web ACL allows you to start analyzing the requests coming to your web applications. In this post, I’ve covered an example of how you can use Amazon Elasticsearch Service as a destination for AWS WAF logs and shown you how to run searches on the log data as well as build graphs and dashboards. Amazon ES is just one of the destination options for native log delivery. To learn more, please refer to the documentation on how to send data directly to Amazon S3, Amazon Redshift, and Splunk.

    If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the WAF forum or contact AWS Support.

    Want more AWS Security news? Follow us on Twitter.

    Tom Adamski

    Tom is a Specialist Solutions Engineer in Networking. He helps customers across the EMEA region design their network environments on AWS. He has over 10 years of experience in the networking and security industries. In his spare time, he’s an avid surfer who’s always on the lookout for cold water surf breaks.

Viewing Amazon Elasticsearch Service Error Logs

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/viewing-amazon-elasticsearch-service-error-logs/

Today, Amazon Elasticsearch Service (Amazon ES) announces support for publishing error logs to Amazon CloudWatch Logs.  This new feature provides you with the ability to capture error logs so you can access information about errors and warnings raised during the operation of the service. These details can be useful for troubleshooting. You can then use this information to work with your users to identify patterns that cause error or warning scenarios on your domain.

Access to the feature is enabled as soon as your domain is created.

You can turn the logs on and off at will, paying only for the CloudWatch charges based on their usage.

Set up delivery of error logs for your domain

To enable error logs for an active domain, sign in to the AWS Management Console and choose Elasticsearch Service. On the Amazon ES console, choose your domain name in the list to open its dashboard. Then choose the Logs tab.

In this pane, you configure your Amazon ES domain to publish search slow logs, indexing slow logs, and error logs to a CloudWatch Logs log group. You can find more information on setting up slow logs in the blog post Viewing Amazon Elasticsearch Service Slow Logs on the AWS Database Blog.

Under Set up Error Logs, choose Setup.

You can choose to Create new log group or Use existing log group. We recommend naming your log group as a path, such as:

/aws/aes/domains/mydomain/application-logs/

This naming scheme makes it easier to apply a CloudWatch access policy, in which you can grant permissions to all log groups under a specific path, such as:

/aws/aes/domains

To deliver logs to your CloudWatch Logs group, you need to specify a policy for Amazon ES so it can publish to CloudWatch Logs on your behalf.  You can choose to Create a new policy or Select an existing policy. You can accept the policy as is. Or, if your log group names are paths, you can widen the Resource—for example:

arn:aws:logs:us-east-1:123456789012:log-group:/aws/aes/domains/*

You can then reuse this policy for all your domains.

Once you have saved the policy for the domain, Choose Enable, and you have completed setup. Your domain can now send error logs to CloudWatch Logs.

Now that you have enabled the publishing of error logs, you can start monitoring them.

Types of events captured

Elasticsearch uses Apache Log4j 2 and its built-in log levels (from least to most severe) of TRACE, DEBUG, INFO, WARN, ERROR, and FATAL. After you enable error logs, Amazon ES publishes log lines of WARN, ERROR, and FATAL to CloudWatch. Less severe levels (INFO, DEBUG and TRACE) are not available.

Based on this, you can expect to find details for events such as the ones highlighted in the following list.

  • Rejects based on exceeding the configured highlight.max_analyzed_offset parameter limit
  • Painless script compilation issues in a request
  • Detailed information about invalid requests and invalid query formats
  • GC cycles
  • Detailed information about write blocks
  • Issues encountered during snapshot exercises

View your log data

To see your log data, sign in to the AWS Management Console, and open the CloudWatch console. In the left navigation pane, choose the Logs tab. Find your log group in the list of groups and open the log group. Your log group name is the Name that you set when you set up logging in the Amazon ES wizard.

Within your log group, you should see a number of log streams.

Amazon ES creates es-test-log-stream during setup of error logs to ensure that it can write to CloudWatch Logs. This stream contains only a single test message.

Your application error logs arrive within 30 minutes and have long hex names, suffixed by es-application-logs to indicate the source of the log data. Choose one of these to view events based on the last event time.

You should see individual entries for each event in timestamp order. To switch from less granular detail to highly granular detail on the event log entry, you can use a toggle at the top right of the CloudWatch Logs console. The format is a timestamp, the locus, the node generating the error or warning, and text cleansed of any specifics on the cluster itself such as you see in this stack trace.

Conclusion

By enabling the error logs feature, you can gain more insight into issues with your Amazon ES domains and identify issues with domain configurations.  Additionally, you can also use the integration of CloudWatch Logs and Amazon ES to send application error logs to a different Amazon ES domain and monitor your domain’s performance.


About the Author

Kevin Fallis is an AWS solutions architect specializing in search technologies.

 

 

 

 

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.