Tag Archives: Elasticsearch

Connecting to Kibana Within an AWS VPC

Post Syndicated from Bozho original https://techblog.bozho.net/connecting-to-kibana-within-an-aws-vpc/

When you use the managed Elasticsearch service on AWS, you usually choose an encrypted connection (via KMS-managed keys), which means you can’t use just any tool to connect to your Elasticsearch cluster. In fact, in order to manually execute commands the easiest option is to use the built-in Kibana and its dev tools.

However, connecting to Kibana is also not trivial due to typical security precautions. Elasticsearch can be run outside or inside a VPC. If you run it outside a VPC, you have to modify its access policy to allow connections from a set of IPs (e.g. your office network).

But if you run it inside a VPC (which is recommended), you have to connect to the VPC. And you have a lot of options for that, but all of them are rather complicated and sometimes even introduce additional cost.

A much simpler approach is to connect via SSH to a machine in the VPC (typically your bastion/jump host) and use it as a SOCKS proxy for your browser. The steps are:

  1. Open an SSH tunnel. If you are using Windows, you can do it with PuTTy. On Linux, you can use ssh$ ssh -D 1337 -q -C -N [email protected]
  2. Set the SOCKS proxy in the browser. On Firefox, open Options and type “SOCKS”, you’ll have only one option (in Network options) to choose, and then set localhost, 1337 (or whatever port you’ve chosen). Here are the instruction for Chrome (or you can use a plugin)
  3. Open the Kibana URL in the browser. Note that now all your browser traffic will go through your VPC, so depending on the VPC configuration other websites might not work.

That’s it, a quick tip that might potentially save a lot of time trying to get a VPC connection to work.

The post Connecting to Kibana Within an AWS VPC appeared first on Bozho's tech blog.

Elasticsearch – Scalability and Multitenancy [slides]

Post Syndicated from Bozho original https://techblog.bozho.net/elasticsearch-scalability-and-multitenancy-slides/

Last week I gave a talk in a local tech group about my experience with Elasticsearch at LogSentinel, and how we achieve multitenancy and scalability.

Obviously, the topic of scalability is huge and it can’t be fully covered in 45 minutes, but I tried presenting the main aspects from the application perspective (I entirely skipped the Ops perspective, as it was a developer audience). The list of resources at the end of the slides show some of the sources of my “research” on the topic, which I recommend going through.

Below are the slides (the talk was not in English):

I hope it’s a useful intro to the topic and the main conclusion is – it’s counterintuitive if you are used to relational databases, and some internals (shards, Lucene segments) “leak” through the abstractions to influence the application design (as per the law of leaky abstractions).

The post Elasticsearch – Scalability and Multitenancy [slides] appeared first on Bozho's tech blog.

Masking field values with Amazon Elasticsearch Service

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/security/masking-field-values-with-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services. Amazon ES provides a deep security model that spans many layers of interaction and supports fine-grained access control at the cluster, index, document, and field level, on a per-user basis. The service’s security plugin integrates with federated identity providers for Kibana login.

A common use case for Amazon ES is log analytics. Customers configure their applications to store log data to the Elasticsearch cluster, where the data can be queried for insights into the functionality and use of the applications over time. In many cases, users reviewing those insights should not have access to all the details from the log data. The log data for a web application, for example, might include the source IP addresses of incoming requests. Privacy rules in many countries require that those details be masked, wholly or in part. This post explains how to set up field masking within your Amazon ES domain.

Field masking is an alternative to field-level security that lets you anonymize the data in a field rather than remove it altogether. When creating a role, add a list of fields to mask. Field masking affects whether you can see the contents of a field when you search. You can use field masking to either perform a random hash or pattern-based substitution of sensitive information from users, who shouldn’t have access to that information.

When you use field masking, Amazon ES creates a hash of the actual field values before returning the search results. You can apply field masking on a per-role basis, supporting different levels of visibility depending on the identity of the user making the query. Currently, field masking is only available for string-based fields. A search result with a masked field (clientIP) looks like this:

  "_index": "web_logs",
  "_type": "_doc",
  "_id": "1",
  "_score": 1,
  "_source": {
    "agent": "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
    "bytes": 0,
    "clientIP": "7e4df8d4df7086ee9c05efe1e21cce8ff017a711ee9addf1155608ca45d38219",
    "host": "www.example.com",
    "extension": "txt",
    "geo": {
      "src": "EG",
      "dest": "CN",
      "coordinates": {
        "lat": 35.98531194,
        "lon": -85.80931806
    "machine": {
      "ram": 17179869184,
      "os": "win 7"

To follow along in this post, make sure you have an Amazon ES domain with Elasticsearch version 6.7 or higher, sample data loaded (this example uses the web logs data supplied by Kibana), and access to Kibana through a role with administrator privileges for the domain.

Configure field masking

Field masking is managed by defining specific access controls within the Kibana visualization system. You’ll need to create a new Kibana role, define the fine-grained access-control privileges for that role, specify which fields to mask, and apply that role to specific users.

You can use either the Kibana console or direct-to-API calls to set up field masking. In our first example, we’ll use the Kibana console.

To configure field masking in the Kibana console

  1. Log in to Kibana, choose the Security pane, and then choose Roles, as shown in Figure 1.

    Figure 1: Choose security roles

    Figure 1: Choose security roles

  2. Choose the plus sign (+) to create a new role, as shown in Figure 2.

    Figure 2: Create role

    Figure 2: Create role

  3. Choose the Index Permissions tab, and then choose Add index permissions, as shown in Figure 3.

    Figure 3: Set index permissions

    Figure 3: Set index permissions

  4. Add index patterns and appropriate permissions for data access. See the Amazon ES documentation for details on configuring fine-grained access control.
  5. Once you’ve set Index Patterns, Permissions: Action Groups, Document Level Security Query, and Include or exclude fields, you can use the Anonymize fields entry to mask the clientIP, as shown in Figure 4.

    Figure 4: Anonymize field

    Figure 4: Anonymize field

  6. Choose Save Role Definition.
  7. Next, you need to create one or more users and apply the role to the new users. Go back to the Security page and choose Internal User Database, as shown in Figure 5.

    Figure 5: Select Internal User Database

    Figure 5: Select Internal User Database

  8. Choose the plus sign (+) to create a new user, as shown in Figure 6.

    Figure 6: Create user

    Figure 6: Create user

  9. Add a username and password, and under Open Distro Security Roles, select the role es-mask-role, as shown in Figure 7.

    Figure 7: Select the username, password, and roles

    Figure 7: Select the username, password, and roles

  10. Choose Submit.

If you prefer, you can perform the same task by using the Amazon ES REST API using Kibana dev tools.

Use the following API to create a role as described in below snippet and shown in Figure 8.

PUT _opendistro/_security/api/roles/es-mask-role
  "cluster_permissions": [],
  "index_permissions": [
      "index_patterns": [
      "dls": "",
      "fls": [],
      "masked_fields": [
      "allowed_actions": [

Sample response:

  "status": "CREATED",
  "message": "'es-mask-role' created."
Figure 8: API to create Role

Figure 8: API to create Role

Use the following API to create a user with the role as described in below snippet and shown in Figure 9.

PUT _opendistro/_security/api/internalusers/es-mask-user
  "password": "xxxxxxxxxxx",
  "opendistro_security_roles": [

Sample response:

  "status": "CREATED",
  "message": "'es-mask-user' created."
Figure 9: API to create User

Figure 9: API to create User

Verify field masking

You can verify field masking by running a simple search query using Kibana dev tools (GET web_logs/_search) and retrieving the data first by using the kibana_user (with no field masking), and then by using the es-mask-user (with field masking) you just created.

Query responses run by the kibana_user (all access) have the original values in all fields, as shown in Figure 10.

Figure 10: Retrieval of the full clientIP data with kibana_user

Figure 10: Retrieval of the full clientIP data with kibana_user

Figure 11, following, shows an example of what you would see if you logged in as the es-mask-user. In this case, the clientIP field is hidden due to the es-mask-role you created.

Figure 11: Retrieval of the masked clientIP data with es-mask-user

Figure 11: Retrieval of the masked clientIP data with es-mask-user

Use pattern-based field masking

Rather than creating a hash, you can use one or more regular expressions and replacement strings to mask a field. The syntax is <field>::/<regular-expression>/::<replacement-string>.

You can use either the Kibana console or direct-to-API calls to set up pattern-based field masking. In the following example, clientIP is masked in such a way that the last three parts of the IP address are masked by xxx using the pattern is clientIP::/[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$/::xxx.xxx.xxx>. You see only the first part of the IP address, as shown in Figure 12.

Figure 12: Anonymize the field with a pattern

Figure 12: Anonymize the field with a pattern

Run the search query to verify that the last three parts of clientIP are masked by custom characters and only the first part is shown to the requester, as shown in Figure 13.

Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user

Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user


Field level security should be the primary approach for ensuring data access security – however if there are specific business requirements that cannot be met with this approach, then field masking may offer a viable alternative. By using field masking, you can selectively allow or prevent your users from seeing private information such as personally identifying information (PII) or personal healthcare information (PHI). For more information about fine-grained access control, see the Amazon Elasticsearch Service Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Prashant Agrawal

Prashant is a Search Specialist Solutions Architect with Amazon Elasticsearch Service. He works closely with team members to help customers migrate their workloads to the cloud. Before joining AWS, he helped various customers use Elasticsearch for their search and analytics use cases.

ElasticSearch Multitenancy With Routing

Post Syndicated from Bozho original https://techblog.bozho.net/elasticsearch-multitenancy-with-routing/

Elasticsearch is great, but optimizing it for high load is always tricky. This won’t be yet another “Tips and tricks for optimizing Elasticsearch” article – there are many great ones out there. I’m going to focus on one narrow use-case – multitenant systems, i.e. those that support multiple customers/users (tenants).

You can build a multitenant search engine in three different ways:

  • Cluster per tenant – this is the hardest to manage and requires a lot of devops automation. Depending on the types of customers it may be worth it to completely isolate them, but that’s rarely the case
  • Index per tenant – this can be fine initially, and requires little additional coding (you just parameterize the “index” parameter in the URL of the queries), but it’s likely to cause problems as the customer base grows. Also, supporting consistent mappings and settings across indexes may be trickier than it sounds (e.g. some may reject an update and others may not depending on what’s indexed). Moving data to colder indexes also becomes more complex.
  • Tenant-based routing – this means you put everything in one cluster but you configure your search routing to be tenant-specific, which allows you to logically isolate data within a single index.

The last one seems to be the preferred option in general. What is routing? The Elasticsearch blog has a good overview and documentation. The idea lies in the way Elasticsearch handles indexing and searching – it splits data into shards (each shard is a separate Lucene index and can be replicated on more than one node). A shard is a logical grouping within a single Elasticsearch node. When no custom routing is used, and an index request comes, the ID is used to determine which shard is going to be used to store the data. However, during search, Elasticsearch doesn’t know which shards have the data, so it has ask multiple shards and gather the results. Related to that, there’s the newly introduced adaptive replica selection, where the proper shard replica is selected intelligently, rather than using round-robin.

Custom routing allows you to specify a routing value when indexing a document and then a search can be directed only to the shard that has the same routing value. For example, at LogSentinel when we index a log entry, we use the data source id (applicationId) for routing. So each application (data source) that generates logs has a separate identifier which allows us to query only that data source’s shard. That way, even though we may have a thousand clients with a hundred data sources each, a query will be precisely targeted to where the data for that particular customer’s data source lies.

This is key for horizontally scaling multitenant applications. When there’s terabytes of data and billions of documents, many shards will be needed (in order to avoid large and heavy shards that cause performance issues). Finding data in this haystack requires the ability to know where to look.

Note that you can (and probably should) make routing required in these cases – each indexed document must be required to have a routing key, otherwise an implementation oversight may lead to a slow index.

Using custom routing you are practically turning one large Elasticsearch cluster into smaller sections, logically separated based on meaningful identifiers. In our case, it is not a userId/customerId, but one level deeper – there are multiple shards per customer, but depending on the use-case, it can be one shard per customer, using the userId/customerId. Using more than one shard per customer may complicate things a little – for example having too many shards per customer may require searches that span too many shards, but that’s not necessarily worse than not using routing.

There are some caveats – the isolation of customer data has to be handled in the application layer (whereas for the first two approaches data is segregated operationally). If there’s an application bug or lack of proper access checks, one user can query data from other users’ shards by specifying their routing key. It’s the role of the application in front of Elasticsearch to only allow queries with routing keys belonging to the currently authenticated user.

There are cases when the first two approaches to multitenancy are viable (e.g. a few very large customers), but in general the routing approach is the most scalable one.

The post ElasticSearch Multitenancy With Routing appeared first on Bozho's tech blog.

Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service

Post Syndicated from Chris Swierczewski original https://aws.amazon.com/blogs/big-data/using-random-cut-forests-for-real-time-anomaly-detection-in-amazon-elasticsearch-service/

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon Elasticsearch Service and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm. In particular, we:

  • Discuss the goals of anomaly detection
  • Share how to use the RCF algorithm to detect anomalies and why we chose RCF for this tool
  • Interpret the output of anomaly detection for Elasticsearch
  • Compare the results of the anomaly detector to commonly used methods

What is anomaly detection?

Human beings have excellent intuition and can detect when something is out of order. Often, an anomaly or outlier can appear so obvious you just “know it when you see it.” However, you can’t base computational approaches to anomaly detection on such intuition; you must found them on mathematical definitions of anomalies.

The mathematical definition of an anomaly is varied and typically addresses the notion of separation from normal observation. This separation can manifest in several ways via multiple definitions. One common definition is “a data point lying in a low-density region.” As you track a data source, such as total bytes transferred from a particular IP address, number of logins on a given website, or number of sales per minute of a particular product, the raw values describe some probability or density distribution. A high-density region in this value distribution is an area of the domain where a data point is highly likely to exist. A low-density region is where data tends not to appear. For more information, see Anomaly detection: A survey.

For example, the following image shows two-dimensional data with a contour map indicating the density of the data in that region.

The data point in the bottom-right corner of the image occurs in a low-density region and, therefore, is considered anomalous. This doesn’t necessarily mean that an anomaly is something bad. Rather, under this definition, you can describe an anomaly as behavior that rarely occurs or is outside the normal scope of behavior.

Random Cut Forests and anomaly thresholding

The algorithmic core of the anomaly detection feature consists of two main components:

  • A RCF model for estimating the density of an input data stream
  • A thresholding model for determining if a point should be labeled as anomalous

You can use the RCF algorithm to summarize a data stream, including efficiently estimating its data density, and convert the data into anomaly scores. Anomaly scores are positive real numbers such that the larger the number, the more anomalous the data point. For more information, see Real Time Anomaly Detection in Open Distro for Elasticsearch.

We chose RCF for this plugin for several reasons:

  • Streaming context – Elasticsearch feature queries are streaming, in that the anomaly detector only receives each new feature aggregate one at a time.
  • Expensive queries – Especially on a large cluster, each feature query may be costly in CPU and memory resources. This limits the amount of historical data we can obtain for model training and initialization.
  • Customer hardware – Our anomaly detection plugin runs on the same hardware as our customers’ Elasticsearch cluster. Therefore, we must be mindful of our plugin’s CPU and memory impact.
  • Scalable – It is preferred if you can distribute the work required to determine anomalous data across the nodes in the cluster.
  • Unlabeled data – Even for training purposes, we don’t have access to labeled data. Therefore, the algorithm must be unsupervised.

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

Understanding the output

The anomaly detector outputs two values: an anomaly grade and a confidence score. The anomaly grade is a measurement of the severity of an anomaly on a scale from zero to one. A zero anomaly grade indicates that the corresponding data point is normal. Any non-zero grade means that the anomaly score output by RCF exceeds the calculated score threshold, and therefore indicates the presence of an anomaly. Using the mathematical definition introduced at the beginning of this post, the grade is inversely related to the anomaly’s density; that is, the rarer the event, the higher the corresponding anomaly grade.

The confidence score is a measurement of the probability that the anomaly detection model correctly reports an anomaly within the algorithm’s inherent error bounds. We derive the model confidence from three sources:

  • A statistical measurement of whether the RCF model has observed enough data. As the RCF model observes more data, this source of confidence approaches 100%.
  • A confidence upper bound comes from the approximation made by the distribution sketch in the thresholding model. The KLL algorithm can only predict the score threshold within particular error bounds with a certain probability.
  • A confidence measurement attributed to each node of the Elasticsearch cluster. If a node is lost, the corresponding model data is also lost, which leads to a temporary confidence drop.

The NYC Taxi dataset

We demonstrate the effectiveness of the new anomaly detector feature on the New York City taxi passenger dataset. This data contains 6 months of taxi ridership volume from New York City aggregated into 30-minute windows. Thankfully, the dataset comes with labels in the form of anomaly windows, which indicate a period of time when an anomalous event is known to occur. Example known events in this dataset are the New York City marathon, where taxi ridership uncharacteristically spiked shortly after the event ended, and the January 2015 North American blizzard, when at one point the city ordered all non-essential vehicles off the streets, which resulted in a significant drop in taxi ridership.

We compare the results of our anomaly detector to two common approaches for detecting anomalies: a rules-based approach and the Gaussian distribution method.

Rules-based approach

In a rules-based approach, you mark a data point as anomalous if it exceeds a preset, human-specified boundary. This approach requires significant domain knowledge of the incoming data and can break down if the data has any upward or downward trends.

The following graph is a plot of the NYC taxi dataset with known anomalous event periods indicated by shaded regions. The model’s anomaly detection output is shown in red below the taxi ridership values. A set of human labelers received the first month of data (1,500 data points) to define anomaly detection rules. Half of the participants responded by stating they didn’t have sufficient information to confidently define such rules. From the remaining responses, the consolidated rules for anomalies are either that the value is equal to or greater than 30,000, or the value is below 20,000 for 150 points (about three days).

In this use case, the human annotators do a good enough job in this particular range of data. However, this approach to anomaly detection doesn’t scale well and may require a large amount of training data before a human can set reasonable thresholds that don’t suffer from a high false positive or high false negative rate. Additionally, as mentioned earlier, if this data develops an upward or downward trend, our team of annotators needs to revisit these constant-value thresholds.

Gaussian distribution method

A second common approach is to fit a Gaussian distribution to the data and define an anomaly as any value that is three standard deviations away from the mean. To improve the model’s ability to adapt to new information, the distribution is typically fit on a sliding window of the observations. Here, we determine the mean and standard deviation from the 1,500 most recent data points and use these to make predictions on the current value. See the following graph.

The Gaussian model detects the clear ridership spike at the marathon but isn’t robust enough to capture the other anomalies. In general, such a model can’t capture certain kinds of temporal anomalies where, for example, there’s a sudden spike in the data or other change in behavior that is still within the normal range of values.

Anomaly detection tool

Finally, we look at the anomaly detection results from the anomaly detection tool. The taxi data is streamed into an RCF model that estimates the density of the data in real time. The RCF sends these anomaly scores to the thresholding model, which decides whether the corresponding data point is anomalous. If so, the model reports the severity of the anomaly in the anomaly grade. See the following graph.

Five out of seven of the known anomalous events are successfully detected with zero false positives. Furthermore, with our definition of anomaly grade, we can indicate which anomalies are more severe than others. For example, the NYC Marathon spike is much more severe than those of Labor Day and New Year’s Eve. Based on the definition of an anomaly in terms of data density, the behavior observed at the NYC Marathon lives in a very low-density region, whereas, by the time we see the New Year’s Eve spike, this kind of behavior is still rare but not as rare anymore.


In this post, you learned about the goals of anomaly detection and explored the details of the model and output of the anomaly detection feature, now available in Amazon ES and Open Distro for Elasticsearch. We also compared the results of the anomaly detection tool to two common models and observed considerable performance improvement.

About the Authors

Chris Swierczewski is an applied scientist in Amazon AI. He enjoys hiking and backpacking with his family.






Lai Jiang is a software engineer working on machine learning and Elasticsearch at Amazon Web Services. His primary interests are algorithms and math. He is an active contributor to Open Distro for Elasticsearch.




Best practices for configuring your Amazon Elasticsearch Service domain

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/best-practices-for-configuring-your-amazon-elasticsearch-service-domain/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy to deploy, secure, scale, and monitor your Elasticsearch cluster in the AWS Cloud. Elasticsearch is a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon ES domains.

The most important practice is to iterate. If you follow these best practices, you can plan for a baseline Amazon ES deployment. Elasticsearch behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.

Deploying Amazon ES

Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via Amazon ES APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.

When you configure your Amazon ES domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch is a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability.

Amazon ES supports five instance classes: M, R, I, C, and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the M5, R5, I3, C5, and T2.

Choosing your instance type for data nodes

When choosing an instance type for your data nodes, bear in mind that these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). As a best practice, for heavy production workloads, choose the R5 or I3 instance type. If your emphasis is primarily on performance, the R5 typically delivers the best performance for log analytics workloads, and often for search workloads. The I3 instances are strong contenders and may suit your workload better, so you should test both. If your emphasis is on cost, the I3 instances have better cost efficiency at scale, especially if you choose to purchase reserved instances.

For an entry-level instance or a smaller workload, choose the M5s. The C5s are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T2 instances for development or QA workloads, but not for production. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

Choosing your instance type for dedicated master nodes

When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. The C5 instances work best as dedicated masters up to about 75 data node clusters. Above that node count, you should choose R5.

Choosing Availability Zones

Amazon ES makes it easy to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones. As a best practice, choose three Availability Zones for your production deployments.

When you choose more than one Availability Zone, Amazon ES deploys data nodes equally across the zones and makes sure that replicas go into different zones. Additionally, when you choose more than one Availability Zone, Amazon ES always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.

Elasticsearch index and shard design

When you use Amazon ES, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.

Amazon ES partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.

Index patterns

For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.

Setting your shard counts

There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Elasticsearch creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Elasticsearch adjusts accordingly by creating or removing replicas.

You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for log analytics is to set an index template. See the following code:

PUT _template/<template_name>
    "index_patterns": ["logs-*"],
    "settings": {
        "number_of_shards": 10,
        "number_of_replicas": 1
    "mappings": {

When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.

You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.

For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.

You should try to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default settings for Elasticsearch versions 6 and below are 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed equally. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.

Determining storage size

So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.

You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.

The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M5 or R5, you can deploy EBS volumes up to the supported limit. For more information, see Amazon Elasticsearch Service Limits.

For instances with ephemeral store, storage is limited by the instance type (for example, I3.8xlarge.elasticsearch has 7.8 TB of attached storage). If you choose EBS, you should use the general purpose, GP2, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.

Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.

After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.

Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.

As you work with shard and instance counts, bear in mind that Amazon ES works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R5.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.

Use case

Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R5 instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.

The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R5.xlarge.elasticsearch, 9x R5.2xlarge.elasticsearch, or 6x R5.4xlarge.elasticsearch instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).

Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R5.4xlarge.elasticsearch has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R5.4xlarge.elasticsearch, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x C5.large.elasticsearch dedicated master nodes.


This post covered some of the core best practices for deploying your Amazon ES domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.




An AWS Elasticsearch Post-Mortem

Post Syndicated from Bozho original https://techblog.bozho.net/aws-elasticsearch-post-mortem/

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it.

Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below.

We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost.

I investigated the possible reasons for such a block. And there are two, according to Amazon – increased JVM memory pressure and low disk space. I checked the metrics and everything looked okay – JVM memory pressure was barely reaching 70% (and 75% is the threshold), and there was more than 200GiB free storage. There was only one WARN in the elasticsearch application logs (it was “node failure”, but after that there were no issues reported)

There was another strange aspect of the issue – there were twice as many nodes as configured. This usually happens during upgrades, as AWS is using blue/green deployment for Elasticsearch, but we haven’t done any upgrade recently. These additional nodes usually go away after a short period of time (after the redeployment/upgrade is ready), but they wouldn’t go away in this case.

Being unable to SSH into the actual machine, being unable to unblock the indexing through Elasticsearch means, and being unable to shut down or restart the nodes, I raised a ticket with support. And after a few ours and a few exchanged messages, the problem was clear and resolved.

The main reason for the issue is 2-fold. First, we had a configuration that didn’t reflect the cluster status – we had assumed a bit more nodes and our shared and replica configuration meant we have unassigned replicas (more on shards and replicas here and here). The best practice is to have nodes > number of replicas, so that each node gets one replica (plus the main shard). Having unassigned shard replicas is not bad per se, and there are legitimate cases for it. Our can probably be seen as misconfiguration, but not one with immediate negative effects. We chose those settings in part because it’s not possible to change some settings in AWS after a cluster is created. And opening and closing indexes is not supported.

The second issue is AWS Elasticsearch logic for calculating free storage in their circuit breaker that blocks indexing. So even though there were 200+ GiB free space on each of the existing nodes, AWS Elasticsearch thought we were out of space and blocked indexing. There was no way for us to see that, as we only see the available storage, not what AWS thinks is available. So, the calculation gets the total number of shards+replicas and multiplies it by the per-shard storage. Which means unassigned replicas that do not take actual space are calculated as if they take up space. That logic is counterintuitive (if not plain wrong), and there is hardly a way to predict it.

This logic appears to be triggered when blue/green deployment occurs – so in normal operation the actual remaining storage space is checked, but during upgrades, the shard-based check is triggered. That has blocked the entire cluster. But what triggered the blue/green deployment process?

We occasionally need access to Kibana, and because of our strict security rules it is not accessible to anyone by default. So we temporarily change the access policy to allow access from our office IP(s). This change is not expected to trigger a new deployment, and has never lead to that. AWS documentation, however, states:

In most cases, the following operations do not cause blue/green deployments: Changing access policy, Changing the automated snapshot hour, If your domain has dedicated master nodes, changing data instance count.
There are some exceptions. For example, if you haven’t reconfigured your domain since the launch of three Availability Zone support, Amazon ES might perform a one-time blue/green deployment to redistribute your dedicated master nodes across Availability Zones.

There are other exceptions, apparently, and one of them happened to us. That lead to the blue/green deployment, which in turn, because of our flawed configuration, triggered the index block based on the odd logic to assume unassigned replicas as taking up storage space.

How we fixed it – we recreated the index with fewer replicas and started a reindex (it takes data from the primary source and indexes it in batches). That reduced the size taken and AWS manually intervened to “unstuck” the blue/green deployment. Once the problem was known, the fix was easy (and we have to recreate the index anyway due to other index configuration changes). It’s appropriate to (once again) say how good AWS support is, in both fixing the issue and communicating it.

As I said in the beginning, this did not mean there’s data loss because we have Kafka keep the messages for a sufficient amount of time. However, once the index was writable, we expected the consumer to continue from the last successful message – we have specifically written transactional behaviour that committed the offsets only after successful storing in the primary storage and successful indexing. Unfortunately, the kafka client we are using had auto-commit turned on that we have overlooked. So the consumer has skipped past the failed messages. They are still in Kafka and we are processing them with a separate tool, but that showed us that our assumption was wrong and the fact that the code calls “commit” doesn’t actually mean something.

So, the morals of the story:

  • Monitor everything. Bad things happen, it’s good to learn about them quickly.
  • Check your production configuration and make sure it’s adequate to the current needs. Be it replicas, JVM sizes, disk space, number of retries, auto-scaling rules, etc.
  • Be careful with managed cloud services. They save a lot of effort but also take control away from you. And they may have issues for which your only choice is contacting support.
  • If providing managed services, make sure you show enough information about potential edge cases. An error console, an activity console, or something, that would allow the customer to know what is happening.
  • Validate your assumptions about default settings of your libraries. (Ideally, libraries should warn you if you are doing something not expected in the current state of configuration)
  • Make sure your application is fault-tolerant, i.e. that failure in one component doesn’t stop the world and doesn’t lead to data loss.

To sum it up, a rare event unexpectedly triggered a blue/green deployment, where a combination of flawed configuration and flawed free space calculation resulted in an unwritable cluster. Fortunately, no data is lost and at least I learned something.

The post An AWS Elasticsearch Post-Mortem appeared first on Bozho's tech blog.

Near Real-Time Indexing With ElasticSearch

Post Syndicated from Bozho original https://techblog.bozho.net/near-real-time-indexing-with-elasticsearch/

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies:

  • Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
  • Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
  • Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
  • Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
  • Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data

We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.

However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.

It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.

This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.

It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.

The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.

The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.

The post Near Real-Time Indexing With ElasticSearch appeared first on Bozho's tech blog.

GraphQL Search Indexing

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/graphql-search-indexing-334c92e0d8d5?source=rss----2615bd06b42e---4

by Artem Shtatnov and Ravi Srinivas Ranganathan

Almost a year ago we described our learnings from adopting GraphQL on the Netflix Marketing Tech team. We have a lot more to share since then! There are plenty of existing resources describing how to express a search query in GraphQL and paginate the results. This post looks at the other side of search: how to index data and make it searchable. Specifically, how our team uses the relationships and schemas defined within GraphQL to automatically build and maintain a search database.

Marketing Tech at Netflix

Our goal is to promote Netflix’s content across the globe. Netflix has thousands of shows on the service, operates in over 190 countries, and supports around 30 languages. For each of these shows, countries, and languages we need to find the right creative that resonates with each potential viewer. Our team builds the tools to produce and distribute these marketing creatives at a global scale, powering 10s of billions of impressions every month!

Various creatives Marketing Tech supports

To enable our marketing stakeholders to manage these creatives, we need to pull together data that is spread across many services — GraphQL makes this aggregation easy.

As an example, our data is centered around a creative service to keep track of the creatives we build. Each creative is enhanced with more information on the show it promotes, and the show is further enhanced with its ranking across the world. Also, our marketing team can comment on the creative when adjustments are needed. There are many more relationships that we maintain, but we will focus on these few for the post.

GraphQL query before indexing

Challenges of Searching Decentralized Data

Displaying the data for one creative is helpful, but we have a lot of creatives to search across. If we produced only a few variations for each of the shows, languages, and countries Netflix supports, that would result in over 50 million total creatives. We needed a proper search solution.

The problem stems from the fact that we are trying to search data across multiple independent services that are loosely coupled. No single service has complete context into how the system works. Each service could potentially implement its own search database, but then we would still need an aggregator. This aggregator would need to perform more complex operations, such as searching for creatives by ranking even though the ranking data is stored two hops away in another service.

If we had a single database with all of the information in it, the search would be easy. We can write a couple join statements and where clauses: problem solved. Nevertheless, a single database has its own drawbacks, mainly, around limited flexibility in allowing teams to work independently and performance limitations at scale.

Another option would be to use a custom aggregation service that builds its own index of the data. This service would understand where each piece of data comes from, know how all of the data is connected, and be able to combine the data in a variety of ways. Apart from the indexing part, these characteristics perfectly describe the entity relationships in GraphQL.

Indexing the Data

Since we already use GraphQL, how can we leverage it to index our data? We can update our GraphQL query slightly to retrieve a single creative and all of its related data, then call that query once for each of the creatives in our database, indexing the results into Elasticsearch. By batching and parallelizing the requests to retrieve many creatives via a single query to the GraphQL server, we can optimize the index building process.

GraphQL query for indexing

Elasticsearch has a lot of customization options when indexing data, but in many cases the default settings give pretty good results. At a minimum, we extract all of the type definitions from the GraphQL query and map them to a schema for Elasticsearch to use.

The nice part about using a GraphQL query to generate the schema is that any existing clients relying on this data will get the same shape of data regardless of whether it comes from the GraphQL server or the search index directly.

Once our data is indexed, we can sort, group, and filter on arbitrary fields; provide typeahead suggestions to our users; display facets for quick filtering; and progressively load data to provide an infinite scroll experience. Best of all, our page can load much faster since everything is cached in Elasticsearch.

Keeping Everything Up To Date

Indexing the data once isn’t enough. We need to make sure that the index is always up to date. Our data changes constantly — marketing users make edits to creatives, our recommendation algorithm refreshes to give the latest title popularity rankings and so on. Luckily, we have Kafka events that are emitted each time a piece of data changes. The first step is to listen to those events and act accordingly.

When our indexer hears a change event it needs to find all the creatives that are affected and reindex them. For example, if a title ranking changes, we need to find the related show, then its corresponding creative, and reindex it. We could hardcode all of these rules, but we would need to keep these rules up to date as our data evolves and for each new index we build.

Fortunately, we can rely on GraphQL’s entity relationships to find exactly what needs to be reindexed. Our search indexer understands these relationships by accessing a shared GraphQL schema or using an introspection query to retrieve the schema.

Our GraphQL query represented as a tree

In our earlier example, the indexer can fan out one level from title ranking to show by automatically generating a query to GraphQL to find shows that are related to the changed title ranking. After that, it queries Elasticsearch using the show and title ranking data to find creatives that reference these values. It can reindex those creatives using the same pipeline used to index them in the first place. What makes this method so great is that after defining GraphQL schemas and resolvers once, there is no additional work to do. The graph has enough data to keep the search index up to date.

Inverted Graph Index

Let’s look a bit deeper into the three steps the search indexer conducts: fan out, search, and index. As an example, if the algorithm starts recommending show 80186799 in Finland, the indexer would generate a GraphQL query to find the immediate parent: the show that the algorithm data is referring to. Once it finds that this recommendation is for Stranger Things, it would use Elasticsearch’s inverted index to find all creatives with show Stranger Things or with the algorithm recommendation data. The creatives are updated via another call to GraphQL and reindexed back to Elasticsearch.

The fan out step is needed in cases where the vertex update causes new edges to be created. If our algorithm previously didn’t have enough data to rank Stranger Things in Finland, the search step alone would never find this data in our index. Also, the fan out step does not need to perform a full graph search. Since GraphQL resolvers are written to only rely on data from the immediate parent, any vertex change can only impact its own edges. The combination of the single graph traversal and searching via an inverted index allows us to greatly increase performance for more complex graphs.

The fanout + search pattern works with more complex graphs

The indexer currently reruns the same GraphQL query that we used to first build our index, but we can optimize this step by only retrieving changes from the parent of the changed vertex and below. We can also optimize by putting a queue in front of both the change listener and the reindexing step. These queues debounce, dedupe, and throttle tasks to better handle spikes in workload.

The overall performance of the search indexer is fairly good as well. Listening to Kafka events adds little latency, our fan out operations are really quick since we store foreign keys to identify the edges, and looking up data in an inverted index is fast as well. Even with minimal performance optimizations, we have seen median delays under 500ms. The great thing is that the search indexer runs in close to constant time after a change, and won’t slow down as the amount of data grows.

Periodic Indexing

We run a full indexing job when we define a new index or make breaking schema changes to an existing index.

In the latter case, we don’t want to entirely wipe out the old index until after verifying that the newly indexed data is correct. For this reason, we use aliases. Whenever we start an indexing job, the indexer always writes the data to a new index that is properly versioned. Additionally, the change events need to be dual written to the new index as it is being built, otherwise, some data will be lost. Once all documents have been indexed with no errors, we swap the alias from the currently active index to the newly built index.

In cases where we can’t fully rely on the change events or some of our data does not have a change event associated with it, we run a periodic job to fully reindex the data. As part of this regular reindexing job, we compare the new data being indexed with the data currently in our index. Keeping track of which fields changed can help alert us of bugs such as a change events not being emitted or hidden edges not modeled within GraphQL.

Initial Setup

We built all of this logic for indexing, communicating with GraphQL, and handling changes into a search indexer service. In order to set up the search indexer there are a few requirements:

  1. Kafka. The indexer needs to know when changes happen. We use Kafka to handle change events, but any system that can notify the indexer of a change in the data would be sufficient.
  2. GraphQL. To act on the change, we need a GraphQL server that supports introspection. The graph has two requirements. First, each vertex must have a unique ID to make it easily identifiable by the search step. Second, for fan out to work, edges in the graph must be bidirectional.
  3. Elasticsearch. The data needs to be stored in a search database for quick retrieval. We use Elasticsearch, but there are many other options as well.
  4. Search Indexer. Our indexer combines the three items above. It is configured with an endpoint to our GraphQL server, a connection to our search database, and mappings from our Kafka events to the vertices in the graph.
How our search indexer is wired up

Building a New Index

After the initial setup, defining a new index and keeping it up to date is easy:

  1. GraphQL Query. We need to define the GraphQL query that retrieves the data we want to index.
  2. That’s it.

Once the initial setup is complete, defining a GraphQL query is the only requirement for building a new index. We can define as many indices as needed, each having its own query. Optionally, since we want to reindex from scratch, we need to give the indexer a way to paginate through all of the data, or tell it to rely on the existing index to bootstrap itself. Also, if we need custom mappings for Elasticsearch, we would need to define the mappings to mirror the GraphQL query.

The GraphQL query defines the fields we want to index and allows the indexer to retrieve data for those fields. The relationships in GraphQL allow keeping the index up to date automatically.

Where the Index Fits

The output of the search indexer feeds into an Elasticsearch database, so we needed a way to utilize it. Before we indexed our data, our browser application would call our GraphQL server, asking it to aggregate all of the data, then we filtered it down on the client side.

Data flow before indexing

After indexing, the browser can now call Elasticsearch directly (or via a thin wrapper to add security and abstract away database complexities). This setup allows the browser to fully utilize the search functionality of Elasticsearch instead of performing searches on the client. Since the data is the same shape as the original GraphQL query, we can rely on the same auto-generated Typescript types and don’t need major code changes.

Data flow after indexing

One additional layer of abstraction we are considering, but haven’t implemented yet, is accessing Elasticsearch via GraphQL. The browser would continue to call the GraphQL server in the same way as before. The resolvers in GraphQL would call Elasticsearch directly if any search criteria are passed in. We can even implement the search indexer as middleware within our GraphQL server. It would enhance the schema for data that is indexed and intercept calls when searches need to be performed. This approach would turn search into a plugin that can be enable on any GraphQL server with minimal configuration.

Using GraphQL to abstract away Elasticsearch


Automatically indexing key queries on our graph has yielded tremendously positive results, but there are a few caveats to consider.

Just like with any graph, supernodes may cause problems. A supernode is a vertex in the graph that has a disproportionately large number of edges. Any changes that affect a supernode will force the indexer to reindex many documents, blocking other changes from being reindexed. The indexer needs to throttle any changes that affect too many documents to keep the queue open for smaller changes that only affect a single document.

The relationships defined in GraphQL are key to determining what to reindex if a change occurred. A hidden edge, an edge not defined fully by one of the two vertices it connects, can prevent some changes from being detected. For example, if we model the relationship between creatives and shows via a third table containing tuples of creative IDs and show IDs, that table would either need to be represented in the graph or its changes attributed to one of the vertices it connects.

By indexing data into a single store, we lose the ability to differentiate user specific aspects of the data. For example, Elasticsearch cannot store unread comment count per user for each of the creatives. As a workaround, we store the total comment count per creative in Elasticsearch, then on page load make an additional call to retrieve the unread counts for the creatives with comments.

Many UI applications practice a pattern of read after write, asking the server to provide the latest version of a document after changes are made. Since the indexing process is asynchronous to avoid bottlenecks, clients would no longer be able to retrieve the latest data from the index immediately after making a modification. On the other hand, since our indexer is constantly aware of all changes, we can expose a websocket connection to the client that notifies it when certain documents change.

The performance savings from indexing come primarily from the fact that this approach shifts the workload of aggregating and searching data from read time to write time. If the application exhibits substantially more writes than reads, indexing the data might create more of a performance hit.

The underlying assumption of indexing data is that you need robust search functionality, such as sorting, grouping, and filtering. If your application doesn’t need to search across data, but merely wants the performance benefits of caching, there are many other options available that can effectively cache GraphQL queries.

Finally, if you don’t already use GraphQL or your data is not distributed across multiple databases, there are plenty of ways to quickly perform searches. A few table joins in a relational database provide pretty good results. For larger scale, we’re building a similar graph-based solution that multiple teams across Netflix can leverage which also keeps the search index up to date in real time.

There are many other ways to search across data, each with its own pros and cons. The best thing about using GraphQL to build and maintain our search index is its flexibility, ease of implementation, and low maintenance. The graphical representation of data in GraphQL makes it extremely powerful, even for use cases we hadn’t originally imagined.

If you’ve made it this far and you’re also interested in joining the Netflix Marketing Technology team to help conquer our unique challenges, check out the open positions listed on our page. We’re hiring!

GraphQL Search Indexing was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to enable secure access to Kibana using AWS Single Sign-On

Post Syndicated from Remek Hetman original https://aws.amazon.com/blogs/security/how-to-enable-secure-access-to-kibana-using-aws-single-sign-on/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service to search, analyze, and visualize data in real-time. The service offers integration with Kibana, an open-source data visualization and exploration tool that lets you perform log and time-series analytics and application monitoring.

Many enterprise customers who want to use these capabilities find it challenging to secure access to Kibana. Kibana users have direct access to data stored in Amazon ES—so it’s important that only authorized users have access to Kibana. Data stored in Amazon ES can also have different classifications. For example, you might have one domain that stores confidential data and another that stores public data. In this case, securing access requires you not only to prevent unauthorized users from accessing the data but also to grant different groups of users access to different data classifications.

In this post, I’ll show you how to secure access to Kibana through Amazon Single Sign-On (AWS SSO) so that only users authenticated to Microsoft Active Directory can access and visualize data stored in Amazon ES. AWS SSO uses standard identity federation via SAML similar to Microsoft ADFS or Ping Federation. AWS SSO integrates with AWS Managed Microsoft Active Directory or Active Directory hosted on-premises or EC2 Instance through AWS Active Directory Connector, which means that your employees can sign into the AWS SSO user portal using their existing corporate Active Directory credentials. In addition, I’ll show you how to map users between an Amazon ES domain and a specific Active Directory security group so that you can limit who has access to a given Amazon ES domain.

Prerequisites and assumptions

You need the following for this walkthrough:

Solution overview

The architecture diagram below illustrates how the solution will authenticate users into Kibana:

Figure 1: Architectural diagram

Figure 1: Architectural diagram

  1. The user requests accesses to Kibana
  2. Kibana sends an HTML form back to the browser with a SAML request for authentication from Cognito. The HTML form is automatically posted to Cognito. User is prompted to then select SSO and authentication request is passed to SSO.
  3. AWS SSO sends a challenge to the browser for credentials
  4. User logs in to AWS SSO. AWS SSO authenticates the user against AWS Directory Service. AWS Directory Service may in turn authenticate the user against an on premise Active Directory.
  5. AWS SSO sends a SAML response to the browser
  6. Browser POSTs the response to Cognito. Amazon Cognito validates the SAML response to verify that the user has been successfully authenticated and then passes the information back to Kibana.
  7. Access to Kibana and Elasticsearch is granted

Deployment and configuration

In this section, I’ll show you how to deploy and configure the security aspects described in the solution overview.

Amazon Cognito authentication for Kibana

First, I’m going to highlight some initial configuration settings for Amazon Cognito and Amazon ES. I’ll show you how to create a Cognito user pool, a user pool domain, and an identity pool, and then how to configure Kibana authentication under Elasticsearch. For each of the commands, remember to replace the placeholders with your own values.

If you need more details on how to set up Amazon Cognito authentication for Kibana, please refer to the service documentation.

  1. Create an Amazon Cognito user pool with the following command:

    aws cognito-idp create-user-pool –pool-name <pool name, for example “Kibana”>

    From the output, copy down the user pool id. You’ll need to provide it in a couple of places later in the process.

                    "CreationDate": 1541690691.411,
                    "EstimatedNumberOfUsers": 0,
                    "Id": "us-east-1_0azgJMX31",
                    "LambdaConfig": {}

  2. Create a user pool domain:

    aws cognito-idp create-user-pool-domain –domain <domain name>–user-pool-id <pool id created in step 1>

    The user pool domain name MUST be the same as your Amazon Elasticsearch domain name. If you receive an error that “domain already exists,” it means the name is already in use and you must choose a different name.

  3. Create your Amazon Cognito federated identities:

    aws cognito-identity create-identity-pool –identity-pool-name <identity pool name e.g. Kibana> –allow-unauthenticated-identities

    To make this command work, you have to temporally allow unauthenticated access by adding –allow-unauthenticated-identities. Unauthenticated access will be removed by Amazon Elasticsearch upon enabling Kibana authentication in the next step.

  4. Create an Amazon Elasticsearch domain. To do so, from the AWS Management Console, navigate to Amazon Elasticsearch and select Create a new domain.
    1. Make sure that value enter under “Elasticsearch domain name” match with the domain created under Cognito User Pool.
    2. Under Kibana authentication, complete the form with the following values, as shown in the screenshot:
      • For Cognito User Pool, enter the name of the pool you created in step one.
      • For Cognito Identity Pool, enter the identity you created in step three.
        Figure 2: Enter the identity you created in step three

        Figure 2: Enter the identity you created in step three

  5. Now you’re ready to assign IAM roles to your identity pool. Those roles will be saved with your identity pool and whenever Cognito receive a request to authorize a user, it will automatically utilize these roles
    1. From the AWS Management Console, go to Amazon Cognito and select Manage Identity Pools.
    2. Select the identity pool you created in step three.
    3. You should receive the following message: You have not specified roles for this identity pool. Click here to fix it. Follow the link.
      Figure 3: Follow the "Click here to fix it" link

      Figure 3: Follow the “Click here to fix it” link

    4. Under Edit identity pool, next to Unauthenticated role, select Create new role.
    5. Select Allow and save your changes.
    6. Next to Unauthenticated role, select Create new role.
    7. Select Allow and save your changes.
  6. Finally, modify the Amazon Elasticsearch access policy:
    1. From the AWS Management Console, go to AWS Identity and Access Management (IAM).
    2. Search for the authenticated role you created in step five and copy the role ARN.
    3. From the mangement console, go to Amazon Elasticsearch Service, and then select the domain you created in step four.
    4. Select Modify access policy and add the following policy (replace the ARN of the authenticated role and the domain ARN with your own values):
                          "Effect": "Allow",
                          "Principal": {
                              "AWS": "<ARN of Authenticated role>"
                          "Action": "es:ESHttp*",
                          "Resource": "<Domain ARN/*>"

      Note: For more information about the Amazon Elasticsearch Service access policy visit: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-ac.html

Configuring AWS Single Sign-On

In this section, I’ll show you how to configure AWS Single Sign-On. In this solution, AWS SSO is used not only to integrate with Microsoft AD but also as a SAML 2.0 identity federation provider. SAML 2.0 is an industry standard used for securely exchanging SAML assertions that pass information about a user between a SAML authority (in this case, Microsoft AD), and a SAML consumer (in this case, Amazon Cognito).

Add Active Directory

  1. From the AWS Management Console, go to AWS Single Sign-On.
  2. If this is the first time you’re configuring AWS SSO, you’ll be asked to enable AWS SSO. Follow the prompt to do so.
  3. From the AWS SSO Dashboard, select Manage your directory.
    Figure 4: Select Manage your directory

    Figure 4: Select Manage your directory

  4. Under Directory, select Change directory.
    Figure 5: Select "Change directory"

    Figure 5: Select “Change directory”

  5. On the next screen, select Microsoft AD Directory, select the directory you created under AWS Directory Service as a part of prerequisites, and then select Next: Review.
    Figure 6: Select "Microsoft AD Directory" and then select the directory you created as a part of the prerequisites

    Figure 6: Select “Microsoft AD Directory” and then select the directory you created as a part of the prerequisites

  6. On the Review page, confirm that you want to switch from an AWS SSO directory to an AWS Directory Service directory, and then select Finish.
    1. Once setup is complete, select Proceed to the directory.

Add application

  1. From AWS SSO Dashboard, select Applications and then Add a new application. Select Add a custom SAML 2.0 application.
    Figure 7: Select "Application" and then "Add a new application"

    Figure 7: Select “Application” and then “Add a new application”

  2. Enter a display name for your application (for example, “Kibana”) and scroll down to Application metadata. Select the link that reads If you don’t have a metadata file, you can manually type your metadata values.
  3. Enter the following values, being sure to replace the placeholders with your own values:
    1. Application ACS URL: https://<Elasticsearch domain name>.auth..amazoncognito.com/saml2/idpresponse
    2. Application SAML audience: urn:amazon:cognito:sp:<user pool id>
  4. Select Save changes.
    Figure 8: Select "Save changes"

    Figure 8: Select “Save changes”

Add attribute mappings

Switch to the Attribute mappings tab and next to Subject, enter ${user:name} and select unspecified under Format as shown in the following screenshot. Click Save Changes.

Figure 9: Enter "${user:name}" and select "Unspecified"

Figure 9: Enter “${user:name}” and select “Unspecified”

For more information about attribute mappings visit: https://docs.aws.amazon.com/singlesignon/latest/userguide/attributemappingsconcept.html

Grant access to Kibana

To manage who has access to Kibana, switch to the Assigned users tab and select Assign users. Add individual users or groups.

Download SAML metadata

Next, you’ll need to download the Amazon SSO SAML metadata. The SAML metadata contains information such as SSO entity ID, public certificate, attributes schema, and other information that’s necessary for Cognito to federate with a SAML identity provider. To download the metadata .xml file, switch to the Configuration tab and select Download metadata file.

Figure 10: Select "Download metadata file"

Figure 10: Select “Download metadata file”

Adding an Amazon Cognito identity provider

The last step is to add the identity provider to the user pool.

  1. From the AWS Management Console, go to Amazon Cognito.
    1. Select Manage User Pools, and then select the user pool you created in the previous section.
    2. From the left side menu, under Federation, select Identity providers, and then select SAML.
    3. Select Select file, and then select the Amazon SSO metadata .xml file you downloaded in previous step.

      Figure 11: Select "Select file" and select the Amazon SSO metadata .xml file you downloaded in previous step

      Figure 11: Select “Select file” and then select the Amazon SSO metadata .xml file you downloaded in previous step

    5. Enter the provider name (for example, “AWS SSO”), and then select Create provider.
  2. From the left side menu, under App integration, select App client settings.
  3. Uncheck Cognito User Pool, check the name of provider you created in step one, and select Save Changes.
    Figure 12: Uncheck "Cognito User Pool"

    Figure 12: Uncheck “Cognito User Pool”

At this point, the configuration is finished. When you open the Kibana URL, you should be redirected to AWS SSO and asked to authenticate using your Active Directory credentials. Keep in mind that if the AWS Elasticsearch domain was created inside VPC, it won’t be accessible from the Internet but only within VPC.

Managing multiple Amazon ES domains

In scenarios where different users need access to different Amazon ES domains, the solution would be as follows for each Amazon ES domain:

  1. Create one Active Directory Security Group per Amazon ES domain
  2. Create an Amazon Cognito user pool for each domain
  3. Add new applications to AWS SSO and grant permission to corresponding security groups
  4. Assign users to the appropriate security group

Deleting domains that use Amazon Cognito Authentication for Kibana

To prevent domains that use Amazon Cognito authentication for Kibana from becoming stuck in a configuration state of “Processing,” it’s important that you delete Amazon ES domains before deleting their associated Amazon Cognito user pools and identity pools.


I’ve outlined an approach to securing access to Kibana by integrating Amazon Cognito with Amazon SSO and AWS Directory Services. This allows you to narrow the scope of users who haves access to each Amazon Elasticsearch domain by configuring separate applications in AWS SSO for each of the domains.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Remek Hetman

Remek is a Senior Cloud Infrastructure Architect with Amazon Web Services Professional Services. He works with AWS financial enterprise customers providing technical guidance and assistance for Infrastructure, Security, DevOps, and Big Data to help them make the best use of AWS services. Outside of work, he enjoys spending time actively, and pursuing his passion – astronomy.