Tag Archives: indexing

Near Real-Time Indexing With ElasticSearch

Post Syndicated from Bozho original https://techblog.bozho.net/near-real-time-indexing-with-elasticsearch/

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies:

  • Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient.
  • Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs
  • Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour.
  • Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database
  • Hybrid – do a combination of the above; for example if you need to enrich the raw data and update the index at a later stage, you can queue items in memory and then use “store in database, index with scheduled job” to update the index and fill in any missing item. Or you can index as some parts of the data come, and use another strategy for the more active types of data

We have recently decided to implement the “queue in memory” approach (in combination with another one, as we have to do some scheduled post-processing anyway). And the first attempt was to use a class provided by the Elasticsearch client – the BulkProcessor. The logic is clear – accumulate index requests in memory and flush them to Elasticsearch in batches either if a certain limit is reached, or at a fixed time interval. So at most every X seconds and at most at every Y records there will be a batch index request. That achieves near real-time indexing without putting too much stress on Elasticsearch. It also allows multiple bulk indexing requests at the same time, as per Elasticsearch recommendations.

However, we are using the REST API (via Jest) which is not supported by the BulkProcessor. We tried to plug a REST indexing logic instead of the current native one, and although it almost worked, in the process we noticed something worrying – the internalAdd method, which gets invoked every time an index request is added to the bulk, is synchronized. Which means threads will block, waiting for each other to add stuff to the bulk. This sounded suboptimal and risky for production environments, so we went for a separate implementation. It can be seen here – ESBulkProcessor.

It allows for multiple threads to flush to Elasticsearch simultaneously, but only one thread (using a lock) to consume from the queue in order to form the batches. Since this is a fast operation, it’s fine to have it serialized. And not because the concurrent queue can’t handle multiple threads reading from it – it can; but reaching the condition for forming the bulk by multiple threads at the same time will result in several small batches rather than one big one, hence the need for only one consumer at a time. This is not a huge problem so the lock can be removed. But it’s important to note it’s not blocking.

This has been in production for a while now and doesn’t seem to have any issues. I will report any changes if there are such due to increased load or edge cases.

It’s important to reiterate the issue if this is the only indexing logic – your application node may fail and you may end up with missing data in Elasticsearch. We are not in that scenario, and I’m not sure which is the best approach to remedy it – be it to do a partial reindex of recent data in case of a failed server, or a batch process the checks if there aren’t mismatches between the database and the index. Of course, we should also say that you may not always have a database – sometimes Elasticsearch is all you have for data storage, and in that case some sort of queue persistence is needed.

The ultimate goal is to have a near real-time indexing as users will expect to see their data as soon as possible, while at the same time not overwhelming the Elasticsearch cluster.

The topic of “what’s the best way to index data” is huge and I hope I’ve clarified it at least a little bit and that our contribution makes sense for other scenarios as well.

The post Near Real-Time Indexing With ElasticSearch appeared first on Bozho's tech blog.

Reddit Repeat Infringer Policy Shuts Down Megalinks Piracy Sub

Post Syndicated from Andy original https://torrentfreak.com/reddit-repeat-infringer-policy-shuts-down-megalinks-piracy-sub-180430/

Without doubt, Reddit is one of the most popular sites on the entire Internet. At the time of writing it’s the fourth most visited site in the US with 330 million users per month generating 14 billion screenviews.

The core of the site’s success is its communities. Known as ‘sub-Reddits’ or just ‘subs’, there are currently 138,000 of them dedicated to every single subject you can think of and tens of thousands you’d never considered.

Even though they’re technically forbidden, a small but significant number are dedicated to piracy, offering links to copyright-infringing content hosted elsewhere. One of the most popular is /r/megalinks, which is dedicated to listing infringing content (mainly movies and TV shows) uploaded to file-hosting site Mega.

Considering its activities, Megalinks has managed to stay online longer than most people imagined but following an intervention from Reddit, the content indexing sub has stopped accepting new submissions, which will effectively shut it down.

In an announcement Sunday, the sub’s moderators explained that following a direct warning from Reddit’s administrators, the decision had been taken to move on.

“As most of you know by now, we’ve had to deal with a lot of DMCA takedowns over the last 6 months. Everyone knew this day would come, eventually, and its finally here,” they wrote.

“We received a formal warning from Reddit’s administration 2 days ago, and have decided to restrict new submissions for the safety of the subreddit.”

The message from Reddit’s operators makes it absolutely clear that Reddit isn’t the platform to host what amounts to a piracy links forum.

“This is an official warning from Reddit that we are receiving too many copyright infringement notices about material posted to your community. We will be required to ban this community if you can’t adequately address the problem,” the warning reads.

Noting that Redditors aren’t allowed to post content that infringes copyrights, the administrators say they are required by law to handle DMCA notices and that in cases where infringement happens on multiple occasions, that needs to be handled in a more aggressive manner.

“The law also requires us to issue bans in cases of repeat infringement. Sometimes a repeat infringement problem is limited to just one user and we ban just that person. Other times the problem pervades a whole community and we ban the community,” the admins continue.

“This is our formal warning about repeat infringement in this community. Over the past three months we’ve had to remove material from the community in response to copyright notices 60 times. That’s an unusually high number taking into account the community’s size.

The warning suggests ways to keep infringing content down but in a sub dedicated to piracy, they’re all completely irrelevant. It also suggests removing old posts to ensure that Reddit doesn’t keep getting notices, but that would mean deleting pretty much everything. Backups exist but a simple file is a poor substitute for a community.

So, with Reddit warning that without change the sub will be banned, the moderators of /r/megalinks have decided to move on to a new home. Reportedly hosted ‘offshore’, their new forum already has more than 9,800 members and is likely to grow quickly as the word spreads.

A month ago, the /r/megaporn sub-Reddit suffered a similar fate following a warning from Reddit’s admins. It successfully launched a new external forum which is why the Megalinks crew decided on the same model.

“[A]fter seeing how /r/megaporn approached the same situation, we had started working on an offshore forum a week ago in anticipation of the ban. This allows us to work however we want, without having to deal with Reddit’s policies and administration,” the team explain.

Ever since the BMG v Cox case went bad ways for the ISP in 2015, repeat infringer policies have become a very hot topic in the US. That Reddit is now drawing a line in the sand over a relatively small number of complaints (at least compared to other similar platforms) is clear notice that Reddit and blatant piracy won’t be allowed to walk hand in hand.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Amazon SageMaker Now Supports Additional Instance Types, Local Mode, Open Sourced Containers, MXNet and Tensorflow Updates

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-sagemaker-roundup-sf/

Amazon SageMaker continues to iterate quickly and release new features on behalf of customers. Starting today, SageMaker adds support for many new instance types, local testing with the SDK, and Apache MXNet 1.1.0 and Tensorflow 1.6.0. Let’s take a quick look at each of these updates.

New Instance Types

Amazon SageMaker customers now have additional options for right-sizing their workloads for notebooks, training, and hosting. Notebook instances now support almost all T2, M4, P2, and P3 instance types with the exception of t2.micro, t2.small, and m4.large instances. Model training now supports nearly all M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large, c4.large, and c5.large instances. Finally, model hosting now supports nearly all T2, M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large instances. Many customers can take advantage of the newest P3, C5, and M5 instances to get the best price/performance for their workloads. Customers also take advantage of the burstable compute model on T2 instances for endpoints or notebooks that are used less frequently.

Open Sourced Containers, Local Mode, and TensorFlow 1.6.0 and MXNet 1.1.0

Today Amazon SageMaker has open sourced the MXNet and Tensorflow deep learning containers that power the MXNet and Tensorflow estimators in the SageMaker SDK. The ability to write Python scripts that conform to simple interface is still one of my favorite SageMaker features and now those containers can be additionally customized to include any additional libraries. You can download these containers locally to iterate and experiment which can accelerate your debugging cycle. When you’re ready go from local testing to production training and hosting you just change one line of code.

These containers launch with support for Tensorflow 1.6.0 and MXNet 1.1.0 as well. Tensorflow has a number of new 1.6.0 features including support for CUDA 9.0, cuDNN 7, and AVX instructions which allows for significant speedups in many training applications. MXNet 1.1.0 adds a number of new features including a Text API mxnet.text with support for text processing, indexing, glossaries, and more. Two of the really cool pre-trained embeddings included are GloVe and fastText.
<

Available Now
All of the features mentioned above are available today. As always please let us know on Twitter or in the comments below if you have any questions or if you’re building something interesting. Now, if you’ll excuse me I’m going to go experiment with some of those new MXNet APIs!

Randall