Tag Archives: Amazon Kendra

Scaling up a Serverless Web Crawler and Search Engine

Post Syndicated from Jack Stevenson original https://aws.amazon.com/blogs/architecture/scaling-up-a-serverless-web-crawler-and-search-engine/

Introduction

Building a search engine can be a daunting undertaking. You must continually scrape the web and index its content so it can be retrieved quickly in response to a user’s query. The goal is to implement this in a way that avoids infrastructure complexity while remaining elastic. However, the architecture that achieves this is not necessarily obvious. In this blog post, we will describe a serverless search engine that can scale to crawl and index large web pages.

A simple search engine is composed of two main components:

  • A web crawler (or web scraper) to extract and store content from the web
  • An index to answer search queries

Web Crawler

You may have already read “Serverless Architecture for a Web Scraping Solution.” In this post, Dzidas reviews two different serverless architectures for a web scraper on AWS. Using AWS Lambda provides a simple and cost-effective option for crawling a website. However, it comes with a caveat: the Lambda timeout capped crawling time at 15 minutes. You can tackle this limitation and build a serverless web crawler that can scale to crawl larger portions of the web.

A typical web crawler algorithm uses a queue of URLs to visit. It performs the following:

  • It takes a URL off the queue
  • It visits the page at that URL
  • It scrapes any URLs it can find on the page
  • It pushes the ones that it hasn’t visited yet onto the queue
  • It repeats the preceding steps until the URL queue is empty

Even if we parallelize visiting URLs, we may still exceed the 15-minute limit for larger websites.

Breaking Down the Web Crawler Algorithm

AWS Step Functions is a serverless function orchestrator. It enables you to sequence one or more AWS Lambda functions to create a longer running workflow. It’s possible to break down this web crawler algorithm into steps that can be run in individual Lambda functions. The individual steps can then be composed into a state machine, orchestrated by AWS Step Functions.

Here is a possible state machine you can use to implement this web crawler algorithm:

Figure 1: Basic State Machine

Figure 1: Basic State Machine

1. ReadQueuedUrls – reads any non-visited URLs from our queue
2. QueueContainsUrls? – checks whether there are non-visited URLs remaining
3. CrawlPageAndQueueUrls – takes one URL off the queue, visits it, and writes any newly discovered URLs to the queue
4. CompleteCrawl – when there are no URLs in the queue, we’re done!

Each part of the algorithm can now be implemented as a separate Lambda function. Instead of the entire process being bound by the 15-minute timeout, this limit will now only apply to each individual step.

Where you might have previously used an in-memory queue, you now need a URL queue that will persist between steps. One option is to pass the queue around as an input and output of each step. However, you may be bound by the maximum I/O sizes for Step Functions. Instead, you can represent the queue as an Amazon DynamoDB table, which each Lambda function may read from or write to. The queue is only required for the duration of the crawl. So you can create the DynamoDB table at the start of the execution, and delete it once the crawler has finished.

Scaling up

Crawling one page at a time is going to be a bit slow. You can use the Step Functions “Map state” to run the CrawlPageAndQueueUrls to scrape multiple URLs at once. You should be careful not to bombard a website with thousands of parallel requests. Instead, you can take a fixed-size batch of URLs from the queue in the ReadQueuedUrls step.

An important limit to consider when working with Step Functions is the maximum execution history size. You can protect against hitting this limit by following the recommended approach of splitting work across multiple workflow executions. You can do this by checking the total number of URLs visited on each iteration. If this exceeds a threshold, you can spawn a new Step Functions execution to continue crawling.

Step Functions has native support for error handling and retries. You can take advantage of this to make the web crawler more robust to failures.

With these scaling improvements, here’s our final state machine:

Figure 2: Final State Machine

Figure 2: Final State Machine

This includes the same steps as before (1-4), but also two additional steps (5 and 6) responsible for breaking the workflow into multiple state machine executions.

Search Index

Deploying a scalable, efficient, and full-text search engine that provides relevant results can be complex and involve operational overheads. Amazon Kendra is a fully managed service, so there are no servers to provision. This makes it an ideal choice for our use case. Amazon Kendra supports HTML documents. This means you can store the raw HTML from the crawled web pages in Amazon Simple Storage Service (S3). Amazon Kendra will provide a machine learning powered search capability on top, which gives users fast and relevant results for their search queries.

Amazon Kendra does have limits on the number of documents stored and daily queries. However, additional capacity can be added to meet demand through query or document storage bundles.

The CrawlPageAndQueueUrls step writes the content of the web page it visits to S3. It also writes some metadata to help Amazon Kendra rank or present results. After crawling is complete, it can then trigger a data source sync job to ensure that the index stays up to date.

One aspect to be mindful of while employing Amazon Kendra in your solution is its cost model. It is priced per index/hour, which is more favorable for large-scale enterprise usage, than for smaller personal projects. We recommend you take note of the free tier of Amazon Kendra’s Developer Edition before getting started.

Overall Architecture

You can add in one more DynamoDB table to monitor your web crawl history. Here is the architecture for our solution:

Figure 3: Overall Architecture

Figure 3: Overall Architecture

A sample Node.js implementation of this architecture can be found on GitHub.

In this sample, a Lambda layer provides a Chromium binary (via chrome-aws-lambda). It uses Puppeteer to extract content and URLs from visited web pages. Infrastructure is defined using the AWS Cloud Development Kit (CDK), which automates the provisioning of cloud applications through AWS CloudFormation.

The Amazon Kendra component of the example is optional. You can deploy just the serverless web crawler if preferred.

Conclusion

If you use fully managed AWS services, then building a serverless web crawler and search engine isn’t as daunting as it might first seem. We’ve explored ways to run crawler jobs in parallel and scale a web crawler using AWS Step Functions. We’ve utilized Amazon Kendra to return meaningful results for queries of our unstructured crawled content. We achieve all this without the operational overheads of building a search index from scratch. Review the sample code for a deeper dive into how to implement this architecture.

Reinventing Enterprise Search – Amazon Kendra is Now Generally Available

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/reinventing-enterprise-search-amazon-kendra-is-now-generally-available/

At the end of 2019, we launched a preview version of Amazon Kendra, a highly accurate and easy to use enterprise search service powered by machine learning. Today, I’m very happy to announce that Amazon Kendra is now generally available.

For all its amazing achievements in past decades, Information Technology had yet to solve a problem that all of us face every day: quickly and easily finding the information we need. Whether we’re looking for the latest version of the company travel policy, or asking a more technical question like “what’s the tensile strength of epoxy adhesives?”, we never seem to be able to get the correct answer right away. Sometimes, we never get it at all!

Not only are these issues frustrating for users, they’re also responsible for major productivity losses. According to an IDC study, the cost of inefficient search is $5,700 per employee per year: for a 1,000-employee company, $5.7 million evaporate every year, not counting the liability and compliance risks imposed by low accuracy search.

This problem has several causes. First, most enterprise data is unstructured, making it difficult to pinpoint the information you need. Second, data is often spread across organizational silos, and stored in heterogeneous backends: network shares, relational databases, 3rd party applications, and more. Lastly, keyword-based search systems require figuring out the right combination of keywords, and usually return a large number of hits, most of them irrelevant to our query.

Taking note of these pain points, we decided to help our customers build the search capabilities that they deserve. The result of this effort is Amazon Kendra.

Introducing Amazon Kendra
With just a few clicks, Amazon Kendra enables organizations to index structured and unstructured data stored in different backends, such as file systems, applications, Intranet, and relational databases. As you would expect, all data is encrypted in flight using HTTPS, and can be encrypted at rest with AWS Key Management Service (KMS).

Amazon Kendra is optimized to understand complex language from domains like IT (e.g. “How do I set up my VPN?”), healthcare and life sciences (e.g. “What is the genetic marker for ALS?”), and many other domains. This multi-domain expertise allows Kendra to find more accurate answers. In addition, developers can explicitly tune the relevance of results, using criteria such as authoritative data sources or document freshness.

Kendra search can be quickly deployed to any application (search page, chat apps, chatbots, etc.) via the code samples available in the AWS console, or via APIs. Customers can be up and running with state the art semantic search from Kendra in minutes.

Many organizations already use Amazon Kendra today. For example, the Allen Institute is committed to solving some of the biggest mysteries of bioscience, researching the unknown of human biology, in the brain, the human cell and the immune system. Says Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI: “One of the most impactful things AI like Amazon Kendra can do right now is help scientists, academics, and technologists quickly find the right information in a sea of scientific literature and move important research faster. The Semantic Scholar team at Allen Institute for AI, along with our partners, is proud to provide CORD-19 and to support the AI resources the community is building to leverage this resource to tackle this crucial problem”.

Introducing New Features in Amazon Kendra
Based on customer feedback collected during the preview phase, we added the following features to Amazon Kendra.

  • New scaling options for the Enterprise Edition, as well as a newly-introduced Developer Edition (see details below).
  • 3 new Cloud Connectors: OneDrive, Salesforce, and ServiceNow (in addition to S3, RDS, and SharePoint Online).
  • Expertise on 8 new domains: Automotive, Health, HR, Legal, Media and Entertainment, News, Telecom, Travel and Leisure (in addition to Chemical, Energy, Finance, Insurance, IT, and Pharmaceuticals).
  • Faster indexing, and improved accuracy.

Indexing Data with Amazon Kendra
For the purpose of this demo, I downloaded a small subset of Wikipedia (about 50,000 web pages). I uploaded the individual files in HTML format to an Amazon Simple Storage Service (S3) bucket.

Heading out to the Kendra console, I start by creating a new index, giving it a name and a description. One click is all it takes to enable encryption with AWS Key Management Service (KMS).

After 30 minutes or so, the index is in service. I can now add data sources to it.

Adding my S3 bucket is extremely easy. I first enter a name for the data source.

Then, I define the name of the S3 bucket. I also need to specify the name of the IAM role used by Kendra, either selecting an existing role or creating a new one.

I’m given the choice to schedule synchronization at periodic intervals, in order to refresh the index with new data added to the data source. I go for a daily refresh running at midnight.

The next screen lets me review all parameters, and create the data source. Once it’s active, I launch the initial synchronization by clicking on the “Sync now” button.

After a little while, synchronization is complete. Moving to the test window, I can now start running queries on the index.

Querying Data with Amazon Kendra
While working on one of my posts the other day, I listened to a Jazz song that I really liked, played by a musician named Thad Jones. Knowing absolutely nothing about Jazz players, I’m curious whether Kendra can help me learn more.

Unsurprisingly, this query matches a large number of documents. However, Kendra comes up with a suggested answer, a high confidence answer to my query. It points at a specific paragraph in one of indexed pages. Relevant content is highlighted for more convenience, and I can immediately see that this is the right answer to my query. No need to look any further! Accordingly, I give it a thumbs up so that Amazon Kendra knows that this is indeed a good answer.

Looking to learn more about Thad Jones, I ask a second question.

Once again, I get a suggested answer. This time, Kendra went one step further by returning the exact answer from the document, instead of just returning the document itself. This shows how Kendra is able to understand context and extract relationships, in this case the link between an individual and their city of birth.

Still curious, I ask a third question.

I get another suggested answer, and it’s once again right on target. The information I’m looking for is in the first sentence: Thad Jones has played with Count Basie. As you can see, the paragraph above doesn’t even include the word “play”. Yet, Amazon Kendra interpreted my question correctly. Thad Jones is a musician: if I’m asking about him playing with someone else, it’s very likely that I’m looking for other musicians, not for sport partners! This ability to understand natural language queries and to extract deep domain knowledge is what makes Amazon Kendra so accurate.

Getting Started
Amazon Kendra is available today in US East (N. Virginia), US West (Oregon), and Europe (Ireland).

You can pick one of two editions.

The Enterprise Edition lets you search up to 500,000 documents, and run up to 40,000 queries per day for $7 per hour. You will also be charged $0.000001 per document scanned, and $0.35 per hour per connector when syncing. If you need more indexing or querying capacity, you can now scale each independently: $3.5 per hour for additional 40,000 queries, and $3.5 per hour for additional 500,000 searchable documents.

The Developer Edition has the same features as the Enterprise Edition. However, it’s limited to 4,000 queries per day, on up to 10,000 searchable documents across 5 data sources. No scaling options are available. Please note that the Developer Edition runs on a single availability zone, which is why it shouldn’t be used for production purposes.

Please give Amazon Kendra a try! We’d love to get your feedback, either through your usual AWS Support contacts, or on the AWS Forum for Kendra.

– Julien