Tag Archives: indexing

Near Real-Time Indexing With ElasticSearch

Post Syndicated from Bozho original https://techblog.bozho.net/near-real-time-indexing-with-elasticsearch/

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies:

  • Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
  • Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
  • Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
  • Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
  • Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data

We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.

However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.

It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.

This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.

It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.

The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.

The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.

The post Near Real-Time Indexing With ElasticSearch appeared first on Bozho's tech blog.

Amazon SageMaker Now Supports Additional Instance Types, Local Mode, Open Sourced Containers, MXNet and Tensorflow Updates

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-sagemaker-roundup-sf/

Amazon SageMaker continues to iterate quickly and release new features on behalf of customers. Starting today, SageMaker adds support for many new instance types, local testing with the SDK, and Apache MXNet 1.1.0 and Tensorflow 1.6.0. Let’s take a quick look at each of these updates.

New Instance Types

Amazon SageMaker customers now have additional options for right-sizing their workloads for notebooks, training, and hosting. Notebook instances now support almost all T2, M4, P2, and P3 instance types with the exception of t2.micro, t2.small, and m4.large instances. Model training now supports nearly all M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large, c4.large, and c5.large instances. Finally, model hosting now supports nearly all T2, M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large instances. Many customers can take advantage of the newest P3, C5, and M5 instances to get the best price/performance for their workloads. Customers also take advantage of the burstable compute model on T2 instances for endpoints or notebooks that are used less frequently.

Open Sourced Containers, Local Mode, and TensorFlow 1.6.0 and MXNet 1.1.0

Today Amazon SageMaker has open sourced the MXNet and Tensorflow deep learning containers that power the MXNet and Tensorflow estimators in the SageMaker SDK. The ability to write Python scripts that conform to simple interface is still one of my favorite SageMaker features and now those containers can be additionally customized to include any additional libraries. You can download these containers locally to iterate and experiment which can accelerate your debugging cycle. When you’re ready go from local testing to production training and hosting you just change one line of code.

These containers launch with support for Tensorflow 1.6.0 and MXNet 1.1.0 as well. Tensorflow has a number of new 1.6.0 features including support for CUDA 9.0, cuDNN 7, and AVX instructions which allows for significant speedups in many training applications. MXNet 1.1.0 adds a number of new features including a Text API mxnet.text with support for text processing, indexing, glossaries, and more. Two of the really cool pre-trained embeddings included are GloVe and fastText.
<

Available Now
All of the features mentioned above are available today. As always please let us know on Twitter or in the comments below if you have any questions or if you’re building something interesting. Now, if you’ll excuse me I’m going to go experiment with some of those new MXNet APIs!

Randall