Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data
We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.
However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.
It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.
This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.
It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.
The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.
The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.
Almost a year ago we described our learnings from adopting GraphQL on the Netflix Marketing Tech team. We have a lot more to share since then! There are plenty of existing resources describing how to express a search query in GraphQL and paginate the results. This post looks at the other side of search: how to index data and make it searchable. Specifically, how our team uses the relationships and schemas defined within GraphQL to automatically build and maintain a search database.
Marketing Tech at Netflix
Our goal is to promote Netflix’s content across the globe. Netflix has thousands of shows on the service, operates in over 190 countries, and supports around 30 languages. For each of these shows, countries, and languages we need to find the right creative that resonates with each potential viewer. Our team builds the tools to produce and distribute these marketing creatives at a global scale, powering 10s of billions of impressions every month!
To enable our marketing stakeholders to manage these creatives, we need to pull together data that is spread across many services — GraphQL makes this aggregation easy.
As an example, our data is centered around a creative service to keep track of the creatives we build. Each creative is enhanced with more information on the show it promotes, and the show is further enhanced with its ranking across the world. Also, our marketing team can comment on the creative when adjustments are needed. There are many more relationships that we maintain, but we will focus on these few for the post.
Challenges of Searching Decentralized Data
Displaying the data for one creative is helpful, but we have a lot of creatives to search across. If we produced only a few variations for each of the shows, languages, and countries Netflix supports, that would result in over50 million total creatives. We needed a proper search solution.
The problem stems from the fact that we are trying to search data across multiple independent services that are loosely coupled. No single service has complete context into how the system works. Each service could potentially implement its own search database, but then we would still need an aggregator. This aggregator would need to perform more complex operations, such as searching for creatives by ranking even though the ranking data is stored two hops away in another service.
If we had a single database with all of the information in it, the search would be easy. We can write a couple join statements and where clauses: problem solved. Nevertheless, a single database has its own drawbacks, mainly, around limited flexibility in allowing teams to work independently and performance limitations at scale.
Another option would be to use a custom aggregation service that builds its own index of the data. This service would understand where each piece of data comes from, know how all of the data is connected, and be able to combine the data in a variety of ways. Apart from the indexing part, these characteristics perfectly describe the entity relationships in GraphQL.
Indexing the Data
Since we already use GraphQL, how can we leverage it to index our data? We can update our GraphQL query slightly to retrieve a single creative and all of its related data, then call that query once for each of the creatives in our database, indexing the results into Elasticsearch. By batching and parallelizing the requests to retrieve many creatives via a single query to the GraphQL server, we can optimize the index building process.
Elasticsearch has a lot of customization options when indexing data, but in many cases the default settings give pretty good results. At a minimum, we extract all of the type definitions from the GraphQL query and map them to a schema for Elasticsearch to use.
The nice part about using a GraphQL query to generate the schema is that any existing clients relying on this data will get the same shape of data regardless of whether it comes from the GraphQL server or the search index directly.
Once our data is indexed, we can sort, group, and filter on arbitrary fields; provide typeahead suggestions to our users; display facets for quick filtering; and progressively load data to provide an infinite scroll experience. Best of all, our page can load much faster since everything is cached in Elasticsearch.
Keeping Everything Up To Date
Indexing the data once isn’t enough. We need to make sure that the index is always up to date. Our data changes constantly — marketing users make edits to creatives, our recommendation algorithm refreshes to give the latest title popularity rankings and so on. Luckily, we have Kafka events that are emitted each time a piece of data changes. The first step is to listen to those events and act accordingly.
When our indexer hears a change event it needs to find all the creatives that are affected and reindex them. For example, if a title ranking changes, we need to find the related show, then its corresponding creative, and reindex it. We could hardcode all of these rules, but we would need to keep these rules up to date as our data evolves and for each new index we build.
Fortunately, we can rely on GraphQL’s entity relationships to find exactly what needs to be reindexed. Our search indexer understands these relationships by accessing a shared GraphQL schema or using an introspection query to retrieve the schema.
In our earlier example, the indexer can fan out one level from title ranking to show by automatically generating a query to GraphQL to find shows that are related to the changed title ranking. After that, it queries Elasticsearch using the show and title ranking data to find creatives that reference these values. It can reindex those creatives using the same pipeline used to index them in the first place. What makes this method so great is that after defining GraphQL schemas and resolvers once, there is no additional work to do. The graph has enough data to keep the search index up to date.
Inverted Graph Index
Let’s look a bit deeper into the three steps the search indexer conducts: fan out, search, and index. As an example, if the algorithm starts recommending show 80186799 in Finland, the indexer would generate a GraphQL query to find the immediate parent: the show that the algorithm data is referring to. Once it finds that this recommendation is for Stranger Things, it would use Elasticsearch’s inverted index to find all creatives with show Stranger Things or with the algorithm recommendation data. The creatives are updated via another call to GraphQL and reindexed back to Elasticsearch.
The fan out step is needed in cases where the vertex update causes new edges to be created. If our algorithm previously didn’t have enough data to rank Stranger Things in Finland, the search step alone would never find this data in our index. Also, the fan out step does not need to perform a full graph search. Since GraphQL resolvers are written to only rely on data from the immediate parent, any vertex change can only impact its own edges. The combination of the single graph traversal and searching via an inverted index allows us to greatly increase performance for more complex graphs.
The indexer currently reruns the same GraphQL query that we used to first build our index, but we can optimize this step by only retrieving changes from the parent of the changed vertex and below. We can also optimize by putting a queue in front of both the change listener and the reindexing step. These queues debounce, dedupe, and throttle tasks to better handle spikes in workload.
The overall performance of the search indexer is fairly good as well. Listening to Kafka events adds little latency, our fan out operations are really quick since we store foreign keys to identify the edges, and looking up data in an inverted index is fast as well. Even with minimal performance optimizations, we have seen median delays under 500ms. The great thing is that the search indexer runs in close to constant time after a change, and won’t slow down as the amount of data grows.
We run a full indexing job when we define a new index or make breaking schema changes to an existing index.
In the latter case, we don’t want to entirely wipe out the old index until after verifying that the newly indexed data is correct. For this reason, we use aliases. Whenever we start an indexing job, the indexer always writes the data to a new index that is properly versioned. Additionally, the change events need to be dual written to the new index as it is being built, otherwise, some data will be lost. Once all documents have been indexed with no errors, we swap the alias from the currently active index to the newly built index.
In cases where we can’t fully rely on the change events or some of our data does not have a change event associated with it, we run a periodic job to fully reindex the data. As part of this regular reindexing job, we compare the new data being indexed with the data currently in our index. Keeping track of which fields changed can help alert us of bugs such as a change events not being emitted or hidden edges not modeled within GraphQL.
We built all of this logic for indexing, communicating with GraphQL, and handling changes into a search indexer service. In order to set up the search indexer there are a few requirements:
Kafka. The indexer needs to know when changes happen. We use Kafka to handle change events, but any system that can notify the indexer of a change in the data would be sufficient.
GraphQL. To act on the change, we need a GraphQL server that supports introspection. The graph has two requirements. First, each vertex must have a unique ID to make it easily identifiable by the search step. Second, for fan out to work, edges in the graph must be bidirectional.
Elasticsearch. The data needs to be stored in a search database for quick retrieval. We use Elasticsearch, but there are many other options as well.
Search Indexer. Our indexer combines the three items above. It is configured with an endpoint to our GraphQL server, a connection to our search database, and mappings from our Kafka events to the vertices in the graph.
Building a New Index
After the initial setup, defining a new index and keeping it up to date is easy:
GraphQL Query. We need to define the GraphQL query that retrieves the data we want to index.
Once the initial setup is complete, defining a GraphQL query is the only requirement for building a new index. We can define as many indices as needed, each having its own query. Optionally, since we want to reindex from scratch, we need to give the indexer a way to paginate through all of the data, or tell it to rely on the existing index to bootstrap itself. Also, if we need custom mappings for Elasticsearch, we would need to define the mappings to mirror the GraphQL query.
The GraphQL query defines the fields we want to index and allows the indexer to retrieve data for those fields. The relationships in GraphQL allow keeping the index up to date automatically.
Where the Index Fits
The output of the search indexer feeds into an Elasticsearch database, so we needed a way to utilize it. Before we indexed our data, our browser application would call our GraphQL server, asking it to aggregate all of the data, then we filtered it down on the client side.
After indexing, the browser can now call Elasticsearch directly (or via a thin wrapper to add security and abstract away database complexities). This setup allows the browser to fully utilize the search functionality of Elasticsearch instead of performing searches on the client. Since the data is the same shape as the original GraphQL query, we can rely on the same auto-generated Typescript types and don’t need major code changes.
One additional layer of abstraction we are considering, but haven’t implemented yet, is accessing Elasticsearch via GraphQL. The browser would continue to call the GraphQL server in the same way as before. The resolvers in GraphQL would call Elasticsearch directly if any search criteria are passed in. We can even implement the search indexer as middleware within our GraphQL server. It would enhance the schema for data that is indexed and intercept calls when searches need to be performed. This approach would turn search into a plugin that can be enable on any GraphQL server with minimal configuration.
Automatically indexing key queries on our graph has yielded tremendously positive results, but there are a few caveats to consider.
Just like with any graph, supernodes may cause problems. A supernode is a vertex in the graph that has a disproportionately large number of edges. Any changes that affect a supernode will force the indexer to reindex many documents, blocking other changes from being reindexed. The indexer needs to throttle any changes that affect too many documents to keep the queue open for smaller changes that only affect a single document.
The relationships defined in GraphQL are key to determining what to reindex if a change occurred. A hidden edge, an edge not defined fully by one of the two vertices it connects, can prevent some changes from being detected. For example, if we model the relationship between creatives and shows via a third table containing tuples of creative IDs and show IDs, that table would either need to be represented in the graph or its changes attributed to one of the vertices it connects.
By indexing data into a single store, we lose the ability to differentiate user specific aspects of the data. For example, Elasticsearch cannot store unread comment count per user for each of the creatives. As a workaround, we store the total comment count per creative in Elasticsearch, then on page load make an additional call to retrieve the unread counts for the creatives with comments.
Many UI applications practice a pattern of read after write, asking the server to provide the latest version of a document after changes are made. Since the indexing process is asynchronous to avoid bottlenecks, clients would no longer be able to retrieve the latest data from the index immediately after making a modification. On the other hand, since our indexer is constantly aware of all changes, we can expose a websocket connection to the client that notifies it when certain documents change.
The performance savings from indexing come primarily from the fact that this approach shifts the workload of aggregating and searching data from read time to write time. If the application exhibits substantially more writes than reads, indexing the data might create more of a performance hit.
The underlying assumption of indexing data is that you need robust search functionality, such as sorting, grouping, and filtering. If your application doesn’t need to search across data, but merely wants the performance benefits of caching, there are many other options available that can effectively cache GraphQL queries.
Finally, if you don’t already use GraphQL or your data is not distributed across multiple databases, there are plenty of ways to quickly perform searches. A few table joins in a relational database provide pretty good results. For larger scale, we’re building a similar graph-based solution that multiple teams across Netflix can leverage which also keeps the search index up to date in real time.
There are many other ways to search across data, each with its own pros and cons. The best thing about using GraphQL to build and maintain our search index is its flexibility, ease of implementation, and low maintenance. The graphical representation of data in GraphQL makes it extremely powerful, even for use cases we hadn’t originally imagined.
If you’ve made it this far and you’re also interested in joining the Netflix Marketing Technology team to help conquer our unique challenges, check out the open positions listed on our page. We’re hiring!
Amazon Elasticsearch Service (Amazon ES) is a fully managed service to search, analyze, and visualize data in real-time. The service offers integration with Kibana, an open-source data visualization and exploration tool that lets you perform log and time-series analytics and application monitoring.
Many enterprise customers who want to use these capabilities find it challenging to secure access to Kibana. Kibana users have direct access to data stored in Amazon ES—so it’s important that only authorized users have access to Kibana. Data stored in Amazon ES can also have different classifications. For example, you might have one domain that stores confidential data and another that stores public data. In this case, securing access requires you not only to prevent unauthorized users from accessing the data but also to grant different groups of users access to different data classifications.
In this post, I’ll show you how to secure access to Kibana through Amazon Single Sign-On (AWS SSO) so that only users authenticated to Microsoft Active Directory can access and visualize data stored in Amazon ES. AWS SSO uses standard identity federation via SAML similar to Microsoft ADFS or Ping Federation. AWS SSO integrates with AWS Managed Microsoft Active Directory or Active Directory hosted on-premises or EC2 Instance through AWS Active Directory Connector, which means that your employees can sign into the AWS SSO user portal using their existing corporate Active Directory credentials. In addition, I’ll show you how to map users between an Amazon ES domain and a specific Active Directory security group so that you can limit who has access to a given Amazon ES domain.
Basic familiarity with Amazon Elasticsearch Service and Kibana.
The architecture diagram below illustrates how the solution will authenticate users into Kibana:
Figure 1: Architectural diagram
The user requests accesses to Kibana
Kibana sends an HTML form back to the browser with a SAML request for authentication from Cognito. The HTML form is automatically posted to Cognito. User is prompted to then select SSO and authentication request is passed to SSO.
AWS SSO sends a challenge to the browser for credentials
User logs in to AWS SSO. AWS SSO authenticates the user against AWS Directory Service. AWS Directory Service may in turn authenticate the user against an on premise Active Directory.
AWS SSO sends a SAML response to the browser
Browser POSTs the response to Cognito. Amazon Cognito validates the SAML response to verify that the user has been successfully authenticated and then passes the information back to Kibana.
Access to Kibana and Elasticsearch is granted
Deployment and configuration
In this section, I’ll show you how to deploy and configure the security aspects described in the solution overview.
Amazon Cognito authentication for Kibana
First, I’m going to highlight some initial configuration settings for Amazon Cognito and Amazon ES. I’ll show you how to create a Cognito user pool, a user pool domain, and an identity pool, and then how to configure Kibana authentication under Elasticsearch. For each of the commands, remember to replace the placeholders with your own values.
If you need more details on how to set up Amazon Cognito authentication for Kibana, please refer to the service documentation.
Create an Amazon Cognito user pool with the following command:
aws cognito-idp create-user-pool –pool-name <pool name, for example “Kibana”>
From the output, copy down the user pool id. You’ll need to provide it in a couple of places later in the process.
aws cognito-idp create-user-pool-domain –domain <domain name>–user-pool-id <pool id created in step 1>
The user pool domain name MUST be the same as your Amazon Elasticsearch domain name. If you receive an error that “domain already exists,” it means the name is already in use and you must choose a different name.
Create your Amazon Cognito federated identities:
aws cognito-identity create-identity-pool –identity-pool-name <identity pool name e.g. Kibana> –allow-unauthenticated-identities
To make this command work, you have to temporally allow unauthenticated access by adding –allow-unauthenticated-identities. Unauthenticated access will be removed by Amazon Elasticsearch upon enabling Kibana authentication in the next step.
Create an Amazon Elasticsearch domain. To do so, from the AWS Management Console, navigate to Amazon Elasticsearch and select Create a new domain.
Make sure that value enter under “Elasticsearch domain name” match with the domain created under Cognito User Pool.
Under Kibana authentication, complete the form with the following values, as shown in the screenshot:
For Cognito User Pool, enter the name of the pool you created in step one.
For Cognito Identity Pool, enter the identity you created in step three.
Figure 2: Enter the identity you created in step three
Now you’re ready to assign IAM roles to your identity pool. Those roles will be saved with your identity pool and whenever Cognito receive a request to authorize a user, it will automatically utilize these roles
From the AWS Management Console, go to Amazon Cognito and select Manage Identity Pools.
Select the identity pool you created in step three.
You should receive the following message: You have not specified roles for this identity pool. Click here to fix it. Follow the link.
Figure 3: Follow the “Click here to fix it” link
Under Edit identity pool, next to Unauthenticated role, select Create new role.
Select Allow and save your changes.
Next to Unauthenticated role, select Create new role.
Select Allow and save your changes.
Finally, modify the Amazon Elasticsearch access policy:
From the AWS Management Console, go to AWS Identity and Access Management (IAM).
Search for the authenticated role you created in step five and copy the role ARN.
From the mangement console, go to Amazon Elasticsearch Service, and then select the domain you created in step four.
Select Modify access policy and add the following policy (replace the ARN of the authenticated role and the domain ARN with your own values):
In this section, I’ll show you how to configure AWS Single Sign-On. In this solution, AWS SSO is used not only to integrate with Microsoft AD but also as a SAML 2.0 identity federation provider. SAML 2.0 is an industry standard used for securely exchanging SAML assertions that pass information about a user between a SAML authority (in this case, Microsoft AD), and a SAML consumer (in this case, Amazon Cognito).
Add Active Directory
From the AWS Management Console, go to AWS Single Sign-On.
If this is the first time you’re configuring AWS SSO, you’ll be asked to enable AWS SSO. Follow the prompt to do so.
From the AWS SSO Dashboard, select Manage your directory.
Figure 4: Select Manage your directory
Under Directory, select Change directory.
Figure 5: Select “Change directory”
On the next screen, select Microsoft AD Directory, select the directory you created under AWS Directory Service as a part of prerequisites, and then select Next: Review.
Figure 6: Select “Microsoft AD Directory” and then select the directory you created as a part of the prerequisites
On the Review page, confirm that you want to switch from an AWS SSO directory to an AWS Directory Service directory, and then select Finish.
Once setup is complete, select Proceed to the directory.
From AWS SSO Dashboard, select Applications and then Add a new application. Select Add a custom SAML 2.0 application.
Figure 7: Select “Application” and then “Add a new application”
Enter a display name for your application (for example, “Kibana”) and scroll down to Application metadata. Select the link that reads If you don’t have a metadata file, you can manually type your metadata values.
Enter the following values, being sure to replace the placeholders with your own values:
To manage who has access to Kibana, switch to the Assigned users tab and select Assign users. Add individual users or groups.
Download SAML metadata
Next, you’ll need to download the Amazon SSO SAML metadata. The SAML metadata contains information such as SSO entity ID, public certificate, attributes schema, and other information that’s necessary for Cognito to federate with a SAML identity provider. To download the metadata .xml file, switch to the Configuration tab and select Download metadata file.
Figure 10: Select “Download metadata file”
Adding an Amazon Cognito identity provider
The last step is to add the identity provider to the user pool.
From the AWS Management Console, go to Amazon Cognito.
Select Manage User Pools, and then select the user pool you created in the previous section.
From the left side menu, under Federation, select Identity providers, and then select SAML.
Select Select file, and then select the Amazon SSO metadata .xml file you downloaded in previous step.
Figure 11: Select “Select file” and then select the Amazon SSO metadata .xml file you downloaded in previous step
Enter the provider name (for example, “AWS SSO”), and then select Create provider.
From the left side menu, under App integration, select App client settings.
Uncheck Cognito User Pool, check the name of provider you created in step one, and select Save Changes.
Figure 12: Uncheck “Cognito User Pool”
At this point, the configuration is finished. When you open the Kibana URL, you should be redirected to AWS SSO and asked to authenticate using your Active Directory credentials. Keep in mind that if the AWS Elasticsearch domain was created inside VPC, it won’t be accessible from the Internet but only within VPC.
Managing multiple Amazon ES domains
In scenarios where different users need access to different Amazon ES domains, the solution would be as follows for each Amazon ES domain:
Create one Active Directory Security Group per Amazon ES domain
Create an Amazon Cognito user pool for each domain
Add new applications to AWS SSO and grant permission to corresponding security groups
Assign users to the appropriate security group
Deleting domains that use Amazon Cognito Authentication for Kibana
To prevent domains that use Amazon Cognito authentication for Kibana from becoming stuck in a configuration state of “Processing,” it’s important that you delete Amazon ES domains before deleting their associated Amazon Cognito user pools and identity pools.
I’ve outlined an approach to securing access to Kibana by integrating Amazon Cognito with Amazon SSO and AWS Directory Services. This allows you to narrow the scope of users who haves access to each Amazon Elasticsearch domain by configuring separate applications in AWS SSO for each of the domains.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.