Tag Archives: Amazon Textract

Unstructured data management and governance using AWS AI/ML and analytics services

2023-10-25 Sakti Mishra

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/unstructured-data-management-and-governance-using-aws-ai-ml-and-analytics-services/

Unstructured data is information that doesn’t conform to a predefined schema or isn’t organized according to a preset data model. Unstructured information may have a little or a lot of structure but in ways that are unexpected or inconsistent. Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. By some estimates, unstructured data can make up to 80–90% of all new enterprise data and is growing many times faster than structured data. After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data.

In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.

Why it’s challenging to process and manage unstructured data

Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). Understanding the data, categorizing it, storing it, and extracting insights from it can be challenging. In addition, identifying incremental changes requires specialized patterns and detecting sensitive data and meeting compliance requirements calls for sophisticated functions. It can be difficult to integrate unstructured data with structured data from existing information systems. Some view structured and unstructured data as apples and oranges, instead of being complementary. But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Solution overview

Data and metadata discovery is one of the primary requirements in data analytics, where data consumers explore what data is available and in what format, and then consume or query it for analysis. If you can apply a schema on top of the dataset, then it’s straightforward to query because you can load the data into a database or impose a virtual table schema for querying. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable.

You can integrate different technologies or tools to build a solution. In this post, we explain how to integrate different AWS services to provide an end-to-end solution that includes data extraction, management, and governance.

The solution integrates data in three tiers. The first is the raw input data that gets ingested by source systems, the second is the output data that gets extracted from input data using AI, and the third is the metadata layer that maintains a relationship between them for data discovery.

The following is a high-level architecture of the solution we can build to process the unstructured data, assuming the input data is being ingested to the raw input object store.

Unstructured Data Management - Block Level Architecture Diagram

The steps of the workflow are as follows:

Integrated AI services extract data from the unstructured data.
These services write the output to a data lake.
A metadata layer helps build the relationship between the raw data and AI extracted output. When the data and metadata are available for end-users, we can break the user access pattern into additional steps.
In the metadata catalog discovery step, we can use query engines to access the metadata for discovery and apply filters as per our analytics needs. Then we move to the next stage of accessing the actual data extracted from the raw unstructured data.
The end-user accesses the output of the AI services and uses the query engines to query the structured data available in the data lake. We can optionally integrate additional tools that help control access and provide governance.
There might be scenarios where, after accessing the AI extracted output, the end-user wants to access the original raw object (such as media files) for further analysis. Additionally, we need to make sure we have access control policies so the end-user has access only to the respective raw data they want to access.

Now that we understand the high-level architecture, let’s discuss what AWS services we can integrate in each step of the architecture to provide an end-to-end solution.

The following diagram is the enhanced version of our solution architecture, where we have integrated AWS services.

Unstructured Data Management - AWS Native Architecture

Let’s understand how these AWS services are integrated in detail. We have divided the steps into two broad user flows: data processing and metadata enrichment (Steps 1–3) and end-users accessing the data and metadata with fine-grained access control (Steps 4–6).

Various AI services (which we discuss in the next section) extract data from the unstructured datasets.
The output is written to an Amazon Simple Storage Service (Amazon S3) bucket (labeled Extracted JSON in the preceding diagram). Optionally, we can restructure the input raw objects for better partitioning, which can help while implementing fine-grained access control on the raw input data (labeled as the Partitioned bucket in the diagram).
After the initial data extraction phase, we can apply additional transformations to enrich the datasets using AWS Glue. We also build an additional metadata layer, which maintains a relationship between the raw S3 object path, the AI extracted output path, the optional enriched version S3 path, and any other metadata that will help the end-user discover the data.
In the metadata catalog discovery step, we use the AWS Glue Data Catalog as the technical catalog, Amazon Athena and Amazon Redshift Spectrum as query engines, AWS Lake Formation for fine-grained access control, and Amazon DataZone for additional governance.
The AI extracted output is expected to be available as a delimited file or in JSON format. We can create an AWS Glue Data Catalog table for querying using Athena or Redshift Spectrum. Like the previous step, we can use Lake Formation policies for fine-grained access control.
Lastly, the end-user accesses the raw unstructured data available in Amazon S3 for further analysis. We have proposed integrating Amazon S3 Access Points for access control at this layer. We explain this in detail later in this post.

Now let’s expand the following parts of the architecture to understand the implementation better:

Using AWS AI services to process unstructured data
Using S3 Access Points to integrate access control on raw S3 unstructured data

Process unstructured data with AWS AI services

As we discussed earlier, unstructured data can come in a variety of formats, such as text, audio, video, and images, and each type of data requires a different approach for extracting metadata. AWS AI services are designed to extract metadata from different types of unstructured data. The following are the most commonly used services for unstructured data processing:

Amazon Comprehend – This natural language processing (NLP) service uses ML to extract metadata from text data. It can analyze text in multiple languages, detect entities, extract key phrases, determine sentiment, and more. With Amazon Comprehend, you can easily gain insights from large volumes of text data such as extracting product entity, customer name, and sentiment from social media posts.
Amazon Transcribe – This speech-to-text service uses ML to convert speech to text and extract metadata from audio data. It can recognize multiple speakers, transcribe conversations, identify keywords, and more. With Amazon Transcribe, you can convert unstructured data such as customer support recordings into text and further derive insights from it.
Amazon Rekognition – This image and video analysis service uses ML to extract metadata from visual data. It can recognize objects, people, faces, and text, detect inappropriate content, and more. With Amazon Rekognition, you can easily analyze images and videos to gain insights such as identifying entity type (human or other) and identifying if the person is a known celebrity in an image.
Amazon Textract – You can use this ML service to extract metadata from scanned documents and images. It can extract text, tables, and forms from images, PDFs, and scanned documents. With Amazon Textract, you can digitize documents and extract data such as customer name, product name, product price, and date from an invoice.
Amazon SageMaker – This service enables you to build and deploy custom ML models for a wide range of use cases, including extracting metadata from unstructured data. With SageMaker, you can build custom models that are tailored to your specific needs, which can be particularly useful for extracting metadata from unstructured data that requires a high degree of accuracy or domain-specific knowledge.
Amazon Bedrock – This fully managed service offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API. It also offers a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security.

With these specialized AI services, you can efficiently extract metadata from unstructured data and use it for further analysis and insights. It’s important to note that each service has its own strengths and limitations, and choosing the right service for your specific use case is critical for achieving accurate and reliable results.

AWS AI services are available via various APIs, which enables you to integrate AI capabilities into your applications and workflows. AWS Step Functions is a serverless workflow service that allows you to coordinate and orchestrate multiple AWS services, including AI services, into a single workflow. This can be particularly useful when you need to process large amounts of unstructured data and perform multiple AI-related tasks, such as text analysis, image recognition, and NLP.

With Step Functions and AWS Lambda functions, you can create sophisticated workflows that include AI services and other AWS services. For instance, you can use Amazon S3 to store input data, invoke a Lambda function to trigger an Amazon Transcribe job to transcribe an audio file, and use the output to trigger an Amazon Comprehend analysis job to generate sentiment metadata for the transcribed text. This enables you to create complex, multi-step workflows that are straightforward to manage, scalable, and cost-effective.

The following is an example architecture that shows how Step Functions can help invoke AWS AI services using Lambda functions.

AWS AI Services - Lambda Event Workflow -Unstructured Data

The workflow steps are as follows:

Unstructured data, such as text files, audio files, and video files, are ingested into the S3 raw bucket.
A Lambda function is triggered to read the data from the S3 bucket and call Step Functions to orchestrate the workflow required to extract the metadata.
The Step Functions workflow checks the type of file, calls the corresponding AWS AI service APIs, checks the job status, and performs any postprocessing required on the output.
AWS AI services can be accessed via APIs and invoked as batch jobs. To extract metadata from different types of unstructured data, you can use multiple AI services in sequence, with each service processing the corresponding file type.
After the Step Functions workflow completes the metadata extraction process and performs any required postprocessing, the resulting output is stored in an S3 bucket for cataloging.

Next, let’s understand how can we implement security or access control on both the extracted output as well as the raw input objects.

Implement access control on raw and processed data in Amazon S3

We just consider access controls for three types of data when managing unstructured data: the AI-extracted semi-structured output, the metadata, and the raw unstructured original files. When it comes to AI extracted output, it’s in JSON format and can be restricted via Lake Formation and Amazon DataZone. We recommend keeping the metadata (information that captures which unstructured datasets are already processed by the pipeline and available for analysis) open to your organization, which will enable metadata discovery across the organization.

To control access of raw unstructured data, you can integrate S3 Access Points and explore additional support in the future as AWS services evolve. S3 Access Points simplify data access for any AWS service or customer application that stores data in Amazon S3. Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket. With S3 Access Points, you can create unique access control policies for each access point to easily control access to specific datasets within an S3 bucket. This works well in multi-tenant or shared bucket scenarios where users or teams are assigned to unique prefixes within one S3 bucket.

An access point can support a single user or application, or groups of users or applications within and across accounts, allowing separate management of each access point. Every access point is associated with a single bucket and contains a network origin control and a Block Public Access control. For example, you can create an access point with a network origin control that only permits storage access from your virtual private cloud (VPC), a logically isolated section of the AWS Cloud. You can also create an access point with the access point policy configured to only allow access to objects with a defined prefix or to objects with specific tags. You can also configure custom Block Public Access settings for each access point.

The following architecture provides an overview of how an end-user can get access to specific S3 objects by assuming a specific AWS Identity and Access Management (IAM) role. If you have a large number of S3 objects to control access, consider grouping the S3 objects, assigning them tags, and then defining access control by tags.

S3 Access Points - Unstructured Data Management - Access Control

If you are implementing a solution that integrates S3 data available in multiple AWS accounts, you can take advantage of cross-account support for S3 Access Points.

Conclusion

This post explained how you can use AWS AI services to extract readable data from unstructured datasets, build a metadata layer on top of them to allow data discovery, and build an access control mechanism on top of the raw S3 objects and extracted data using Lake Formation, Amazon DataZone, and S3 Access Points.

In addition to AWS AI services, you can also integrate large language models with vector databases to enable semantic or similarity search on top of unstructured datasets. To learn more about how to enable semantic search on unstructured data by integrating Amazon OpenSearch Service as a vector database, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

As of writing this post, S3 Access Points is one of the best solutions to implement access control on raw S3 objects using tagging, but as AWS service features evolve in the future, you can explore alternative options as well.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Bhavana Chirumamilla is a Senior Resident Architect at AWS with a strong passion for data and machine learning operations. She brings a wealth of experience and enthusiasm to help enterprises build effective data and ML strategies. In her spare time, Bhavana enjoys spending time with her family and engaging in various activities such as traveling, hiking, gardening, and watching documentaries.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and trade-offs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family—usually on tennis courts.

Daniel Bruno is a Principal Resident Architect at AWS. He had been building analytics and machine learning solutions for over 20 years and splits his time helping customers build data science programs and designing impactful ML products.

Optimizing data with automated intelligent document processing solutions

2023-04-28 Deependra Shekhawat

Post Syndicated from Deependra Shekhawat original https://aws.amazon.com/blogs/architecture/optimizing-data-with-automated-intelligent-document-processing-solutions/

Many organizations struggle to effectively manage and derive insights from the large amount of unstructured data locked in emails, PDFs, images, scanned documents, and more. The variety of formats, document layouts, and text makes it difficult for any standard Optical Character Recognition (OCR) to extract key insights from these data sources.

To help organizations overcome these document management and information extraction challenges, AWS offers connected, pre-trained artificial intelligence (AI) service APIs that help drive business outcomes from these document-based rich data sources.

This blog post describes a cost-effective, scalable automated intelligent document processing solution that leverages a Natural Processing Language (NLP) engine using Amazon Textract and Amazon Comprehend. This solution helps customers take advantage of industry leading machine learning (ML) technology in their document workflows without the need for in-house ML expertise.

Customer document management challenges

Customers across industry verticals experience the following document management challenges:

Extraction process accuracy varies significantly when applied to diverse sources; specifically handwritten text, images, and scanned documents.
Existing scripting and rule-based solutions cannot provide customer domain or problem-specific classifiers.
Traditional document management systems cannot consider feedback from domain experts to improve the learning process.
The Personally Identifiable Information (PII) data-handling is not robust or customizable, causing data privacy leakage concern.
Many manual interventions are required to complete the entire process.

Automated intelligent document processing solution

We introduced an automated intelligent document processing implementation to address key document management challenges. At the heart of the solution is a NLP engine that combines:

Amazon Textract
Amazon Comprehend
Amazon SageMaker
Custom regular expression-based Python parser

The full solution also leverages other AWS services as described in the following diagram (Figure 1) and steps to develop and operate a cost-effective and scalable architecture for document processing. It effectively extracts text from document types including PDFs, images, scanned documents, Microsoft Excel workbooks, and more.

Figure 1: AI-based intelligent document processing engine

Solution overview

Let’s explore the automated intelligent document processing solution step by step.

The document upload engine or business users upload the respective files or documents through a custom web application to the designated Amazon Simple Storage Service (Amazon S3) bucket.
The event-based architecture signals an Amazon S3 push event to invoke the respective AWS Lambda function to start document pre-processing.
The Lambda function evaluates the document payload, leverages Amazon Simple Queue Service (Amazon SQS) for async processing, prepares document metadata, stores it in Amazon DynamoDB, and calls the NLP engine to perform the information extraction process.
The NLP engine leverages Amazon Textract for text extraction from a variety of sources and leverages document metadata to optimize the appropriate API calls (for example, form, tabular, or PDF).
- Amazon Textract output is fed into Amazon Comprehend which consumes the extracted text and performs entity parsing, line/paragraph-based sentiment analysis, and document/paragraph classification. For better accuracy, we leverage a custom classifier within Amazon Comprehend.
- Amazon Comprehend also provides key APIs to mask PII data before it is used for any further consumption. The solution offers the ability to configure masking rules for each PII entity per masking requirements.
- To ensure the solution has capability to handle data from Microsoft Excel workbooks, we developed a custom parser using Python running inside an AWS Lambda function. Depending on the document metadata, this function can be invoked.
Output of Amazon Comprehend is then fed to ML models deployed using Amazon SageMaker depending on additional use cases configured by the customer to complement the overall process with ML-based recommendations, predictions, and personalization.
Once the NLP engine completes its processing, the job completion notification event signals another AWS Lambda function and updates the status in the respective Amazon SQS queue.
The Lambda post-processing function parses the resultant content generated by the NLP engine and stores it in the Amazon DynamoDB and Amazon S3 bucket. This step is responsible for the required data augmentation, key entities validation, and default value assignment to create a data structure that could be consumed by the presentation/visualization layer.
Users get the flexibility to see the extracted information and compare it with the original document extract in the custom user interface (UI). They can provide their feedback on extraction and entity parsing accuracy. From a user access management perspective, Amazon Cognito provides authorization and authentication.

Customer benefits

The automated intelligent document processing solution helps customers:

Increase overall document management efficiency by 50-60%, leveraging automation and nullifying manual interventions
Reduce in-house team involvement in administrative activities by up to 70% using integrated and connected processing workflows
Gain better visibility into key contractual obligations with features such as Document Classification (helps properly route documents to the respective process/team) and Obligation Extraction
Utilize a UI-based feedback mechanism for in-house domain experts/reviewers to see and validate the extracted information and offer feedback to inform further model training

From a cost-optimization perspective, depending on document type and required information, only the respective Amazon Textract APIs calls are submitted. (For example, it is not worth using form/table-based Textract API calls for a Know Your Customer (KYC) document such as a driver’s license or passport when the AnalyzeID API is the most efficient solution.)

To maximize solution benefits, customers should invest time in building well-defined taxonomies ahead of using the document processing solution to accommodate their own use cases or industry domain-specific requirements. Their taxonomy input highlights only relevant keys and takes respective actions in case the requires keys are not extracted.

Vertical industry use cases

As mentioned, this document processing solution can be used across industry segments. Let’s explore some practical use cases. For example, it can help insurance industry professionals to accelerate claim processing and customer KYC-related processes. By extracting the key entities from the claim documents, mapping them against the customer defined taxonomy, and integrating with Amazon SageMaker models for anomaly detection (anomalous claims), insurance providers can improve claim management and customer satisfaction.

In the healthcare industry, the solution can help with medical records and report processing, key medical entity extraction, and customer data masking.

The document processing solution can help the banking industry by automating check processing and delivering the ability to extract key entities like payer, payee, date, and amount from the checks.

Conclusion

Manual document processing is resource-intensive, time consuming, and costly. Customers need to allocate resources to process large volume documents, lowering business agility. Their employees are performing manual “stare and compare” tasks, potentially reducing worker morale and preventing them from focusing where their efforts are better placed.

Intelligent document processing helps businesses overcome these challenges by automating the classification, extraction, and analysis of data. This expedites decision cycles, allocates resources to high-value tasks, and reduces costs.

Pre-trained APIs of AWS AI services allow for quick classification, extraction, and data analyzation from scores of documents. This solution also has industry specific features that can quickly process specialized industry specific documents. This blog discussed the foundational architecture to helps to accelerate implementation of any specific document processing use case.

Classifying and Extracting Mortgage Loan Data with Amazon Textract

2022-11-28 Steve Roberts

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/classifying-and-extracting-mortgage-loan-data-with-amazon-textract/

Mortgage loan applications, at least in the United States, comprise around 500 or more pages of diverse documents. In order for applications to be reviewed, all these documents need to be classified, and the data on each form extracted. This isn’t as easy as it might sound! Besides different data structures in each document, the same data element may have different names on different documents—for example, SSN, or Social Security Number, or Tax ID. These three all refer to the same data.

Today, a new Analyze Lending API, for analyzing and classifying the documents contained in mortgage loan application packages, and extracting the data they contain, is available for Amazon Textract. The new API was created in response to requests from major lenders in the industry to help them process applications faster and reduce errors, which improves the end-customer experience and lower operating costs.

Until now, classification and extraction of data from mortgage loan application packages have been human-intensive tasks, although some lenders have used a hybrid approach, using technology such as Amazon Textract. However, customers told us that they needed even greater workflow automation to speed up automation efforts and reduce human error so that their staff could focus on higher-value tasks.

The new API also provides additional value-add services. It’s able to perform signature detection in terms of which documents have signatures and which don’t. It also provides a summary output of the documents in a mortgage application package and identifies select important documents such as bank statements and 1003 forms that would normally be present. The new workflow is powered by a collection of machine learning (ML) models. When a mortgage application package is uploaded, the workflow classifies the documents in the package before routing them to the right ML model, based on their classification, for data extraction.

Test-Driving the New Analyze Lending API
Although the new API is intended for lenders to incorporate into their business process workflows and applications, anyone can actually try it using the Amazon Textract console. This enables you to see how the API classifies documents and extracts the data elements they contain. If you’re interested in the application of machine learning and artificial intelligence, this may be of interest to you even if you’re not processing a mortgage application package.

I start by opening the Amazon Textract console, expanding Analyze Lending in the navigation panel, and then selecting Demo. The demo console immediately analyzes a set of synthetic test files, and outputs the results shown below (you can always restart the demo by clicking the Reset demo button). I get a summary of the analysis results and a document carousel for each of the documents in the package. The demo console also has a handy help panel containing (among other things) a summary of terminology related to the documents.

In the carousel I can see that one document has a signature badge, indicating a signature was detected, but, before taking a look, if I scroll the carousel, I can see that one document was labeled Unclassified:

Returning in the carousel to the document marked with a signature badge, I can see that it’s a check. Signature detection is usually a highly manual process so having the document analysis automatically mark when one is detected is a significant time saver.

Payslips are another document type that customers have told us can be difficult and time-consuming to handle. Selecting the detected payslip in the carousel shows the data extracted from it.

The synthetic data in the demo console provides an overview of how the API is able to analyze, classify, and extract data from the documents in a mortgage application package. However, I can also use my own documents. To do this in the demo console, I click the Upload package button and provide a single file, up to 5 MB, and 10 pages maximum for testing in the demo console, containing documents to analyze. Outside use in the demo console, the API supports documents with up to 3000 pages.

The results, for both the synthetic and your own data, can be downloaded by clicking the Download results button. This provides a .zip file containing four files—two are the raw JSON responses from the API. The other two are CSV-format files containing the summary (summary.csv) and the extracted data (extractions.csv). Both files are in key-value format.

The contents of the summary data file, for the synthetic test data, are below.

'DocumentName,'FirstPage,'LastPage
"'Payslips","'1","'1"
"'Checks","'2","'2"
"'Identity document","'3","'3"
"'1099 DIV","'4","'4"
"'Bank statement","'5","'5"
"'W2","'6","'6"
"'Unclassified","'7","'7"

Below is an example of the data contained in the extractions file.

'key,'value
"'PAY PERIOD END DATE","'7/18/2008"
"'PAY DATE","'7/25/2008"
"'BORROWER NAME","'JOHN STILES"
"'BORROWER ADDRESS","'101 MAIN STREET ANYTOWN, USA 12345"
"'COMPANY NAME","'ANY COMPANY CORP."
"'COMPANY ADDRESS","'475 ANY AVENUE ANYTOWN, USA 10101"
"'FEDERAL FILING STATUS","'Married"
"'STATE FILING STATUS","'2"
"'CURRENT GROSS PAY","'$ 452.43"
"'YTD GROSS PAY","'23,526.80"
"'CURRENT NET PAY","'$ 291.90"
"'REGULAR HOURLY RATE","'10.00"
"'HOLIDAY HOURLY RATE","'10.00"
"'WARNINGS MESSAGES NOTES","'EFFECTIVE THIS PAY PERIOD YOUR REGULAR HOURLY RATE HAS BEEN CHANGED FROM $8.00 TO $10.00 PER HOUR."
"'CURRENT REGULAR PAY","'320"
...

Try the Analyze Lending API Yourself
The new API is available in all Regions where Amazon Textract is offered but do be aware that the workflow and processing are focused on US-centric documents. Pricing for the new API is the same as for the existing table, form, and queries. You can find more details on the service pricing page. Finally, you can read more on the API in the Developer Guide.

Explore the new Analyze Lending API for yourself today in the Amazon Textract console!

— Steve

AWS Week in Review – November 7, 2022

2022-11-07 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-7-2022/

With three weeks to go until AWS re:Invent opens in Las Vegas, the AWS News Blog Team is hard at work creating blog posts to share the latest launches and previews with you. As usual, we have a strong mix of new services, new features, and a surprise or two.

Last Week’s Launches
Here are some launches that caught my eye last week:

Amazon SNS Data Protection and Masking – After a quick public preview, this cool feature is now generally available. It uses pattern matching, machine learning models, and content policies to help protect data at scale. You can find many different kinds of personally identifiable information (PII) and protected health information (PHI) in message bodies and either block message delivery or mask (de-identify) the sensitive data, all in real-time and on a per-topic basis. To learn more, read the blog post or the message data protection documentation.

Amazon Textract Updates – This service extracts text, handwriting, and data from any document or image. This past week we updated the AnalyzeID function so that it can now extract the machine readable zone (MRZ) on passports issued by the United States, and we added the entire OCR output to the API response. We also updated the machine learning models that power the AnalyzeDocument function, with a focus on single-character boxed forms commonly found on tax and immigration documents. Finally, we updated the AnalyzeExpense function with support for new fields and higher accuracy for existing fields, bringing the total field count to more than 40.

Another Amazon Braket Processor – Our quantum computing service now supports Aquila, a new 256-qubit quantum computer from QuEra that is based on a programmable array of neutral Rubidium atoms. According to the What’s New, Aquila supports the Analog Hamiltonian Simulation (AHS) paradigm, allowing it to solve for the static and dynamic properties of quantum systems composed of many interacting particles.

Amazon S3 on Outposts – This service now lets you use additional S3 Lifecycle rules to optimize capacity management. You can expire objects as they age or are replaced with newer versions, with control at the bucket level, or for subsets defined by prefixes, object tags, or object sizes. There’s more info in the What’s New and in the S3 documentation.

AWS CloudFormation – There were two big updates last week: support for Amazon RDS Multi-AZ deployments with two readable standbys, and better access to detailed information on failed stack instances for operations on CloudFormation StackSets.

Amazon MemoryDB for Redis – You can now use data tiering as a lower cost way to to scale your clusters up to hundreds of terabytes of capacity. This new option uses a combination of instance memory and SSD storage in each cluster node, with all data stored durably in a multi-AZ transaction log. There’s more information in the What’s New and the blog post.

Amazon EC2 – You can now remove launch permissions for Amazon Machine Images (AMIs) that are directly shared with your AWS account.

X in Y – We launched existing AWS services and instance types in additional Regions:

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras highlights new open source projects, tools, and demos from the AWS Community. Read Installment 134 to see what’s going on!

New Case Study – A new AWS case study describes how Taggle (a company focused on smart water solutions in Australia) created an IoT platform that runs on AWS and uses Amazon Kinesis Data Streams to store & ingest data in real time. Using AWS allowed them to scale to accommodate 80,000 additional sensors that will roll out in 2022.

Upcoming AWS Events
re:Invent 2022 – AWS re:Invent is just three weeks away! Join us live from November 28th to December 2nd for keynotes, training and certification opportunities, and over 1,500 technical sessions. If you cannot make it to Las Vegas you can also join us online to watch the keynotes and leadership sessions live. Be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

PeerTalk – If you will be attending re:Invent in person and are interested in meeting with me or any of our featured experts, be sure to check out PeerTalk, our new onsite networking program.

That’s all for this week!

— Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

Automate your Data Extraction for Oil Well Data with Amazon Textract

2022-02-24 Ashutosh Pateriya

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/automate-your-data-extraction-for-oil-well-data-with-amazon-textract/

Traditionally, many businesses archive physical formats of their business documents. These can be invoices, sales memos, purchase orders, vendor-related documents, and inventory documents. As more and more businesses are moving towards digitizing their business processes, it is becoming challenging to effectively manage these documents and perform business analytics on them. For example, in the Oil and Gas (O&G) industry, companies have numerous documents that are generated through the exploration and production lifecycle of an oil well. These documents can provide many insights that can help inform business decisions.

As documents are usually stored in a paper format, information retrieval can be time consuming and cumbersome. Even those available in a digital format may not have adequate metadata associated to efficiently perform search and build insights.

In this post, you will learn how to build a text extraction solution using Amazon Textract service. This will automatically extract text and data from scanned documents and upload into Amazon Simple Storage Service (S3). We will show you how to find insights and relationships in the extracted text using Amazon Comprehend. This data is indexed and populated into Amazon OpenSearch Service to search and visualize it in a Kibana dashboard.

Figure 1 illustrates a solution built with AWS, which extracts O&G well data information from PDF documents. This solution is serverless and built using AWS Managed Services. This will help you to decrease system maintenance overhead while making your solution scalable and reliable.

Figure 1. Automated form data extraction architecture

Following are the high-level steps:

Upload an image file or PDF document to Amazon S3 for analysis. Amazon S3 is a durable document storage used for central document management.
Amazon S3 event initiates the AWS Lambda function Fn-A. AWS Lambda has functional logic to call the Amazon Textract and Comprehend services and processing.
AWS Lambda function Fn-A invokes Amazon Textract to extract text as key-value pairs from image or PDF. Amazon Textract automatically extracts data from the scanned documents.
Amazon Textract sends the extracted keys from image/PDF to Amazon SNS.
Amazon SNS notifies Amazon SQS when text extraction is complete by sending the extracted keys to Amazon SQS.
Amazon SQS initiates AWS Lambda function Fn-B with the extracted keys.
AWS Lambda function Fn-B invokes Amazon Comprehend for the custom entity recognition. Comprehend uses custom-trained machine learning (ML) to find discrepancies in key names from Amazon Textract.
The data is indexed and loaded into Amazon OpenSearch, which indexes and visualizes the data.
Kibana processes the indexed data.
User accesses Kibana to search documents.

Steps illustrated with more detail:

1. User uploads the document for analysis to Amazon S3. Uploaded document can be an image file or a PDF. Here we are using the S3 console for document upload. Figure 2 shows the sample file used for this demo.

Figure 2. Sample input form

2. Amazon S3 upload event initiates AWS Lambda function Fn-A. Refer to the AWS tutorial to learn about S3 Lambda configuration. View Sample code for Lambda FunctionA.

3. AWS Lambda function Fn-A invokes Amazon Textract. Amazon Textract uses artificial intelligence (AI) to read as a human would, by extracting text, layouts, tables, forms, and structured data with context and without configuration, training, or custom code.

4. Amazon Textract starts processing the file as it is uploaded. This process takes few minutes since the file is a multipage document.

5. Amazon SNS notifies Amazon Textract of completion. Amazon Textract processing works asynchronously, as we decouple our architecture using Amazon SQS. To configure Amazon SNS to send data to Amazon SQS:

Create an SNS topic. ‘AmazonTextract-SNS’ is the SNS topic that we created for this demo.
Then create an SQS queue. ‘AmazonTextract-SQS’ is the queue that we created for this demo.
To receive messages published to a topic, you must subscribe an endpoint to the topic. When you subscribe an endpoint to a topic, the endpoint begins to receive messages published to the associated topic. Figure 3 shows the SNS topic ‘AmazonTextract-SNS’ subscribed to Amazon SQS queue.

Figure 3. Amazon SNS configuration

Figure 4. Amazon SQS configuration

6. Configure SQS queue to initiate the AWS Lambda function Fn-B. This should happen upon receiving extracted data via SNS topic. Refer to this SQS tutorial to learn about SQS Lambda configuration. See Sample code for Lambda FunctionB.

7. AWS Lambda function Fn-B invokes Amazon Comprehend for the custom entity recognition.

Figure 5. Lambda FunctionB configuration in Amazon Comprehend

Configure Amazon Comprehend to create a custom entity recognition (text-job2) for the entities. These can be API Number, Lease_Number, Water_Depth, Well_Number, and can use the model created in previous step (well_no, well#, well num). For instructions on labeling your data, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

Figure 6. Comprehend job

Now create an endpoint for the custom entity recognition for the Lambda function, to send the data to Amazon Comprehend service, as shown in Figure 7 and 8.

Figure 7. Comprehend endpoint creation

Copy the Amazon Comprehend endpoint ARN to include it in the Lambda function as an environment variable (see Figure 5).

Figure 8. Comprehend endpoint created successfully

8. Launch an Amazon OpenSearch domain. See Creating and managing Amazon OpenSearch Service domains. The data is indexed and populated into Amazon OpenSearch. The Amazon OpenSearch domain name is configured at Lambda FnB as an environment variable to push the extracted data to OpenSearch.

9. Kibana processes the indexed data from Amazon OpenSearch. Amazon OpenSearch data is populated on Kibana, shown in Figure 9.

Figure 9. Kibana dashboard showing Amazon OpenSearch data

10. Access Kibana for document search. The selected fields can be viewed as a table using filters, see Figure 10.

Figure 10. Kibana dashboard table view for selected fields

You can search the LEASE_NUMBER = OCS-031, as shown in Figure 11.

Figure 11. Kibana dashboard search on Lease Number

OR you can search all the information for the WATER_DEPTH = 60, see Figure 12.

Figure 12. Kibana dashboard search on Water Depth

Cleanup

Shut down OpenSearch domain
Delete the Comprehend endpoint
Clear objects from S3 bucket

Conclusion

Data is growing at an enormous pace in all industries. As we have shown, you can build an ML-based text extraction solution to uncover the unstructured data from PDFs or images. You can derive intelligence from diverse data sources by incorporating a data extraction and optimization function. You can gain insights into the undiscovered data, by leveraging managed ML services, Amazon Textract, and Amazon Comprehend.

The extracted data from PDFs or images is indexed and populated into Amazon OpenSearch. You can use Kibana to search and visualize the data. By implementing this solution, customers can reduce the costs of physical document storage, in addition to labor costs for manually identifying relevant information.

This solution will drive decision-making efficiency. We discussed the oil and gas industry vertical as an example for this blog. But this solution can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders.

For further reading:

17 additional AWS services authorized for DoD workloads in the AWS GovCloud Regions

2021-09-09 Tyler Harding

Post Syndicated from Tyler Harding original https://aws.amazon.com/blogs/security/17-additional-aws-services-authorized-for-dod-workloads-in-the-aws-govcloud-regions/

I’m pleased to announce that the Defense Information Systems Agency (DISA) has authorized 17 additional Amazon Web Services (AWS) services and features in the AWS GovCloud (US) Regions, bringing the total to 105 services and major features that are authorized for use by the U.S. Department of Defense (DoD). AWS now offers additional services to DoD mission owners in these categories: business applications; computing; containers; cost management; developer tools; management and governance; media services; security, identity, and compliance; and storage.

Why does authorization matter?

DISA authorization of 17 new cloud services enables mission owners to build secure innovative solutions to include systems that process unclassified national security data (for example, Impact Level 5). DISA’s authorization demonstrates that AWS effectively implemented more than 421 security controls by using applicable criteria from NIST SP 800-53 Revision 4, the US General Services Administration’s FedRAMP High baseline, and the DoD Cloud Computing Security Requirements Guide.

Recently authorized AWS services at DoD Impact Levels (IL) 4 and 5 include the following:

Business Applications

Amazon Simple Email Service (Amazon SES) – Inbound and outbound cloud email
Amazon Pinpoint – Multichannel marketing communication
AWS Marketplace – A digital catalog with thousands of software listings from independent software vendors that you can use to find, test, buy, and deploy software that runs on AWS

Compute

AWS Fargate (a feature of Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS)) – A serverless compute engine for containers

Containers

Amazon Elastic Kubernetes Service (Amazon EKS) – A trusted way to run Kubernetes

Cost Management

AWS Budgets – Set custom budgets to track your cost and usage, from the simplest to the most complex use cases
AWS Cost Explorer – An interface that lets you visualize, understand, and manage your AWS costs and usage over time
AWS Cost & Usage Report – Itemize usage at the account or organization level by product code, usage type, and operation

Developer Tools

AWS CodePipeline – Automate continuous delivery pipelines for fast and reliable updates
AWS X-Ray – Analyze and debug production and distributed applications, such as those built using a microservices architecture

Management & Governance

AWS License Manager – Manage your software licenses from vendors
AWS Personal Health Dashboard – Provide alerts and guidance for AWS events that might affect your environment
AWS Systems Manager – An operations hub for AWS

Media Services

Amazon Textract – Extract printed text, handwriting, and data from virtually any document

Security, Identity & Compliance

Amazon Cognito – Secure user sign-up, sign-in, and access control
AWS Security Hub – Centrally view and manage security alerts and automate security checks

Storage

AWS Backup – Centrally manage and automate backups across AWS services

Figure 1 shows the IL 4 and IL 5 AWS services that are now authorized for DoD workloads, broken out into functional categories.

Figure 1: The AWS services newly authorized by DISA

To learn more about AWS solutions for the DoD, see our AWS solution offerings. Follow the AWS Security Blog for updates on our Services in Scope by Compliance Program. If you have feedback about this blog post, let us know in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Amazon Textract Updates: Up to 32% Price Reduction in 8 AWS Regions and Up to 50% Reduction in Asynchronous Job Processing Times

2021-08-26 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-textract-updates-up-to-32-price-reduction-in-8-aws-regions-and-up-to-50-reduction-in-asynchronous-job-processing-times/

Introduced at AWS re:Invent 2018, Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

In the past few months, we introduced specialized support for processing invoices and receipts and enhanced the quality of the underlying computer vision models that power extraction of handwritten text, forms, and tables with printed text support for English, Spanish, German, Italian, Portuguese, and French.

Third-party auditors assess the security and compliance of Amazon Textract as part of multiple AWS compliance programs. We also added IRAP compliance support and achieved US FedRAMP authorization to add to the existing list such as HIPAA, PCI DSS, ISO SCO, and MTCS.

Customers use Amazon Textract to automate critical business process workflows (for example, in claims and tax form processing, loan applications, and accounts payable). It can reduce human review time, improve accuracy, lower costs, and accelerate the pace of innovation on a global scale. At the same time, Textract customers told us that we could be doing even more to reduce costs and improve latency.

Today we are excited to announce two major updates to Amazon Textract:

Up to 32 percent price reduction in 8 AWS Regions to help global customers save even more with Textract.
Up to 50 percent reduction in end-to-end job processing times for Textract’s asynchronous operations worldwide.

Up to 32% price reduction in 8 AWS Regions
We are pleased to announce an up to 32 percent price reduction in eight AWS Regions: Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Canada (Central), Europe (Frankfurt), Europe (London), and Europe (Paris).

The API pricing for DetectDocumentText (OCR) and AnalyzeDocument (both forms and tables) in these AWS Regions is now the same as the US East (N. Virginia) Region pricing. Customers in those identified Regions will see a 9-32 percent reduction in API pricing.

Before the price reduction, a customer’s usage of the DetectDocumentText and AnalyzeDocument APIs would have been billed at different rates, by Region, for their usage tier. That customer will now be billed at the same rate, no matter from which AWS commercial Region Textract is being called.

AWS Regions	DetectDocumentText API			AnalyzeDocument API (forms + tables)
AWS Regions	Old	New	Reduction	Old	New	Reduction
Asia Pacific (Mumbai)	$1.830	$1.50	18%	$79.30	$65.0	18%
Asia Pacific (Seoul)	$1.845		19%	$79.95		19%
Asia Pacific (Singapore)	$2.200		32%	$95.00		32%
Asia Pacific (Sydney)	$1.950		23%	$84.50		23%
Canada (Central)	$1.655		9%	$72.15		10%
Europe (Frankfurt)	$1.875		20%	$81.25		20%
Europe (London)	$1.750		14%	$75.00		13%
Europe (Paris)	$1.755		15%	$76.05		15%

This table shows two examples of effective price per 1,000 pages for processing the first 1 million monthly pages before and after this price reduction. Customers with usage above the 1 million monthly pages tier will also see a similar reduction in prices, the details of which can be found on the Amazon Textract pricing page.

The new pricing goes into effect on September 1, 2021. It will be applied to your bill automatically. This pricing change does not apply to the Europe (Ireland), US-based commercial Regions, and US GovCloud Regions. There is no change to the pricing for the recently launched AnalyzeExpense API for invoices and receipts.

As part of the AWS Free Tier, you can get started with Amazon Textract for free. The Free Tier lasts 3 months and new AWS customers can analyze up to 1,000 pages per month using the Detect Document Text API and up to 100 pages per month using the Analyze Document API or Analyze Expense API.

Up to 50% reduction in end-to-end job processing times
Customers can invoke Textract synchronously (on single-page documents) and asynchronously (on multi-page documents) for detecting printed and handwritten lines and words (via the DetectDocumentText API) as well as for forms and tables extraction (via the AnalyzeDocument API). We see that the vast majority of customers invoke Textract asynchronously today for at-scale processing of their document pipeline.

Based on customer feedback, we have made a number of enhancements to Textract’s asynchronous API operations that reduce the end-to-end latency by as much as 50 percent. Specifically, these updates reduce the end-to-end job processing times experienced by Textract customers on worldwide asynchronous operations by as much as 50 percent. The lower the processing time, the faster customers are able to process their documents, achieve scale and improve their overall productivity.

To learn more about Amazon Textract, see this tutorial for extracting text and structured data from a document, this code sample on GitHub, Amazon Textract documentation, and blog posts about Amazon Textract on the AWS Machine Learning Blog.

— Channy

Automate Document Processing in Logistics using AI

2021-08-19 Manikanth Pasumarti

Post Syndicated from Manikanth Pasumarti original https://aws.amazon.com/blogs/architecture/automate-document-processing-in-logistics-using-ai/

Multi-modal transportation is one of the biggest developments in the logistics industry. There has been a successful collaboration across different transportation partners in supply chain freight forwarding for many decades. But there’s still a considerable overhead of paperwork processing for each leg of the trip. Tens of billions of documents are processed in ocean freight forwarding alone. Using manual labor to process these documents (purchase orders, invoices, bills of lading, delivery receipts, and more) is both expensive and error-prone.

In this blog post, we’ll address how to automate the document processing in the logistics industry. We’ll also show you how to integrate it with a centralized workflow management.

Automated document processing architecture

Figure 1. Architecture of document processing workflow

The solution workflow shown in Figure 1 is as follows:

Documents that belong to the same transaction are collected in an S3 bucket
The document processing workflow is initiated
The workflow orchestration is as follows:
- Document is processed via automation
- Relevant entities are extracted
- Extracted data is reviewed
- Order data is consolidated

This architecture uses Amazon Simple Storage Service (S3) for document storage, and Amazon Simple Queue Service (SQS) for workflow initiation. Amazon Textract is used for text extraction, Amazon Comprehend for entity extraction, and Amazon Augmented AI (A2I) for human review. This will ensure correct results in cases of low confidence predictions.

We use AWS Step Functions for the orchestration of document processing workflow. Step functions also help to improve the application resiliency with less code.

AWS Lambda functions are used to:

Detect if all required documents for a given transaction are available in Amazon S3
Kick off the process by creating an Amazon SQS message
Detect a new processing job from a generated SQS message
Extract text from PDFs using a Step Function
Extract entities from generated text using a Step Function
Control data completeness and accuracy
Initiate a human loop when needed using a Step Function
Consolidate the data collected from documents
Store the data into the database

Document ingestion and classification

There are several data ingestion options available such as AWS Transfer Family, AWS DataSync, and Amazon Kinesis Data Firehose. Choose the appropriate ingestion blueprints based on the type of data sources. Typical real-time ingestion blueprints include AWS Lambda processing and an Amazon CloudWatch event. The batch pipeline can leverage AWS Step Functions. This can be used to orchestrate the Lambda function that initiates the document processing workflow.

Here are some things to consider when building your document ingestion and storage solution:

Choose your bucket strategy. Amazon S3 is an object store. Analyze your data pipeline ingestion carefully and choose the correct S3 bucket strategy for each document type (bills, supplier invoices, and others.)
Organize your data. The data is organized in S3 buckets by layers: Raw, Staging, and Processed. Each has their own respective bucket policy and access control.
Build a creation tool. This is an automated data lake bucket/folder structure tool, based on your data ingestion requirements. You can use this same structure for user-created data.
Define data security requirements. Do this before you begin the ingestion process. Before ingesting new or current data sources into AWS, secure access to the data.
Review security credentials needed for access. After copying these credentials into AWS Systems Manager (SSM), apply an AWS Key Management Service (KMS) key to encrypt the file. This encrypted key string is stored in SSM to use for authentication.

Document processing workflow

Overview

The workflow checks the input buckets until it detects all the documents types necessary for a complete dataset. In our case, it is the invoice document and customs authorization form. Once both are detected, it generates a job request as a message in Amazon SQS. A Lambda function then processes the message and kicks off the Step Function flow (see Figure 2). The state machine then initiates the document processing, text extraction, and optional human review steps. AWS Step Functions are well suited for our use case due to its ability to manage long-running workflows.

Figure 2. Visual workflow of document processing in AWS Step Functions

Entity extraction

For each document, entities are extracted using Amazon Textract and Amazon Comprehend. These entities can include date, company, address, bill of materials, total cost, and invoice number.

Following is a sample invoice document that is fed to Amazon Textract, which extracts the form data and creates key-value pairs.

Figure 3. Highlighted different entities in the sample invoice document

See Figure 4 for an example of the key-value pairs extracted for the sample invoice. The keys here represent the form labels (“SHIP TO”) and the values represent form values (shipping address).

Figure 4. Key-value pairs of the invoice data, extracted by Amazon Textract

Amazon Textract also generates a raw text output that contains the entire text, as shown in Figure 5 following.

Figure 5. Raw text output of the invoice data extracted by Amazon Textract

To achieve a higher degree of confidence, Amazon Comprehend is used to identify and extract the custom entities. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to identify and extracts insights and entities from text data. You can train Amazon Comprehend to identify entities relevant to your organization. These can be product names, part numbers, department names, or other entities. You can also train Amazon Comprehend to categorize documents or assign relevant labels to text.

An Amazon Comprehend entity recognizer comes with a set of pre-built entity types. Amazon Comprehend can introduce custom entities to match our specific business needs. Some of the entities we want to identify are address and company name. We trained a custom recognizer to detect company names and addresses, see Figure 6.

Figure 6. Training details of custom entity recognizer

Figure 7 shows the resulting output from Amazon Comprehend:

Figure 7. Amazon Comprehend entity recognition output

The document is processed top-down, from left to right, from the sample invoice in Figure 3. We know that the first company and first address belongs to the Billing Company. And the second set belongs to the Shipment recipient. Along with detecting custom entities, Amazon Comprehend also outputs the confidence score of the extracted result.

Confidence scores can vary depending on how close training data is to actual data. In the example preceding, the first company entity came back with a score of 0.941. Let’s assume that we have set a minimum confidence score of 0.95. Anything below that threshold should be reviewed by a human. The following section describes the last step of our workflow.

Human review

Amazon Augmented AI (A2I) allows you to create and manage human loops. A human loop is a manual review task that gets assigned to a workforce. The workforce can be public, such as Mechanical Turk, or private, such as internal team or a paid contractor. In our example, we created a private workforce to review the entities we were not confident about. Figure 8 shows an example of the user interface that the reviewers use to assign entities to the proper text sections.

Figure 8. Manual review interface of Amazon A2I

Review tasks can be automatically submitted to the workforce based on dynamic criteria, after both AI-related steps are completed. It can be used to review the text detected by Amazon Textract when key data elements are missing (such as order amount or quantity). It can also review entities after invoking Amazon Comprehend.

Figure 9. Consolidated dataset of processed invoice and customs authorization data

After the manual review step, data can be consolidated (as shown in Figure 9) and stored into a relational database. It can also be shared with other business units such as Accounting or Customer Services. You can apply the same process to other document types such as custom forms, which are linked to the same transaction. This allows us to process and combine information that comes from disparate paper sources more efficiently.

Conclusion

This post demonstrates how document processing can be automated to process business documentation by using Amazon Textract, Amazon Comprehend and Amazon Augmented AI.

Deploying an automated solution in the logistics industry takes away the undifferentiated heavy lifting involved in manual document processing. This helps to cut down the delivery delays and track any missed deliveries. By providing a comprehensive view of the shipment, it increases the efficiency of back-office processing. It can also further simplify the data collection for audit purposes.

To learn more:

CohnReznick Automates Claim Validation Workflow Using AWS AI Services

2021-07-23 Rajeswari Malladi

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/cohnreznick-automates-claim-validation-workflow-using-aws-ai-services/

This post was co-written by Winn Oo and Brendan Byam of CohnReznick and Rajeswari Malladi and Shanthan Kesharaju

CohnReznick is a leading advisory, assurance, and tax firm serving clients around the world. CohnReznick’s government and public sector practice provides claims audit and verification services for state agencies. This process begins with recipients submitting documentation as proof of their claim expenses. The supporting documentation often contains hundreds of filled-out or scanned (sometimes handwritten) PDFs, MS Word files, Excel spreadsheets, and/or pictures, along with a summary form outlining each of the claimed expenses.

Prior to automation with AWS artificial intelligence (AI) services, CohnReznick’s data extraction and validation process was performed manually. Audit professionals had to extract each data point from the submitted documentation, select a population sample for testing, and manually search the documentation for any pages or page sections that validated the information submitted. Validated data points and proof of evidence pages were then packaged into a single document and submitted for claim expense reimbursement.

In this blog post, we’ll show you how CohnReznick implemented Amazon Textract, Amazon Comprehend (with a custom machine learning classification model), and Amazon Augmented AI (Amazon A2I). With this solution, CohnReznick automated nearly 40% of the total claim verification process with focus on data extraction and package creation. This resulted in an estimated cost savings of $500k per year for each project and process.

Automating document processing workflow

Figure 1 shows the newly automated process. Submitted documentation is processed by Amazon Textract, which extracts text from the documents. This text is then submitted to Amazon Comprehend, which employs a custom classification model to classify the documents as claim summaries or evidence documents. All data points are collected from the Amazon Textract output of the claim summary documents. These data points are then validated against the evidence documents.

Finally, a population sample of the extracted data points is selected for testing. Rather than auditors manually searching for specific information in the documentation, the automated process conducts the data search, extracts the validated evidence pages from submitted documentation, and generates the audited package, which can then be submitted for reimbursement.

Figure 1. Architecture diagram

Components in the solution

At a high level, the end-to-end process starts with auditors using a proprietary web application to submit the documentation received for each case to the document processing workflow. The workflow includes three stages, as described in the following sections.

Text extraction

First, the process extracts the text from the submitted documents using the following steps:

For each case, the CohnReznick proprietary web application uploads the documents to the Amazon Simple Storage Service (Amazon S3) upload bucket. Each file has a unique name, and the files have metadata that associates them with the parent case.
The uploaded documents Amazon Simple Queue Service (Amazon SQS) queue is configured to receive notifications for all new objects added to the upload bucket. For every new document added to the upload bucket, Amazon S3 sends a notification to the uploaded documents queue.
The text extraction AWS Lambda function runs every 5 minutes to poll the uploaded documents queue for new messages.
For each message in the uploaded documents queue, the text extraction function submits an Amazon Textract job to process the document asynchronously. This continues until it reaches a predefined maximum allowed limit of concurrent jobs for that AWS account. Concurrency control is implemented by handling LimitExceededException on StartDocumentAnalysis API call.
After Amazon Textract finishes processing a document, it sends a completion notification to a completed jobs Amazon Simple Notification Service (Amazon SNS) topic.
A process job results Lambda function is subscribed to the completed jobs topic and receives a notification for every completed message sent to the completed jobs topic.
The process job results function then fetches document extraction results from Amazon Textract.
The process job results function stores the document extraction results in the Amazon Textract output bucket.

Documents classification

Next, the process classifies the documents. The submitted claim documents can consist of up to seven supporting document types. The documents need to be classified into the respective categories. They are primarily classified using automation. Any documents classified with a low confidence score are sent to a human review workflow.

Classification model creation

The custom classification feature of Amazon Comprehend is used to build a custom model to classify documents into the seven different document types as required by the business process. The model is trained by providing sample data in CSV format. Amazon Comprehend uses multiple algorithms in the training process and picks the model that delivers the highest accuracy for the training data.

Classification model invocation and processing

The automated document classification uses the trained model and the classification consists of the following steps:

The business logic in the process job results Lambda function determines text extraction completion for all documents for each case. It then calls the StartDocumentClassificationJob operation on the custom classifier model to start classifying unlabeled documents.
The document classification results from the custom classifier are returned as a single output.tar.gz file in the comprehend results S3 bucket.
At this point, the check confidence scores Lambda function is invoked, which processes the classification results.
The check confidence scores function reviews the confidence scores of classified documents. The results for documents with high confidence scores are saved to the classification results table in Amazon DynamoDB.

Human review

The documents from the automated classification that have low confidence scores are classified using human review with the following steps:

The check confidence scores Lambda function invokes human review with Amazon Augmented AI for documents with low confidence scores. Amazon A2I is a ready-to-use workflow service for human review of machine learning predictions.
The check confidence scores Lambda function creates human review tasks for each document with a low confidence score. Humans assigned to the classification jobs log into the human review portal and either approve the classification done by the model or reclassify the text with the right labels.
The results from human review are placed in the A2I results bucket.
The update results Lambda function is invoked to process results from the human review.
Finally, the update results function writes the human review document classification results to the classification results table in DynamoDB.

Additional processes

Documents workflow status capturing

The Lambda functions throughout the workflow update the status of their processing and document/case details in the workflow status table in DynamoDB. The auditor that submitted the case documents will know the status of the workflow of their submitted case using the data in workflow status table.

Search and package creation

When the processing is complete for a case, auditors perform the final review and submit the generated packet for downstream processing.

The web application uses AWS SDK for Java to integrate with the Textract output S3 bucket that has the document extraction results and classification results table in DynamoDB with classification results. This data is used for the search and package creation process.

Purge data process

After the package creation is complete, the auditor can purge all data in the workflow.

Using the AWS SDK, the data is purged from the S3 buckets and DynamoDB tables.

Conclusion

As seen in this blog post, Amazon Textract, Amazon Comprehend, and Amazon A2I for human review work together with Amazon S3, DynamoDB, and Lambda services. These services have helped CohnReznick automate nearly 40% of their total claim verification process with focus on data extraction and package creation.

You can achieve similar efficiencies and increase scalability by automating your business processes. Get started today by reading additional user stories and using the resources on automated document processing.

Getting started with RPA using AWS Step Functions and Amazon Textract

2020-11-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/

This post is courtesy of Joe Tringali, Solutions Architect.

Many organizations are using robotic process automation (RPA) to automate workflow, back-office processes that are labor-intensive. RPA, as software bots, can often handle many of these activities. Often RPA workflows contain repetitive manual tasks that must be done by humans, such as viewing invoices to find payment details.

AWS Step Functions is a serverless function orchestrator and workflow automation tool. Amazon Textract is a fully managed machine learning service that automatically extracts text and data from scanned documents. Combining these services, you can create an RPA bot to automate the workflow and enable employees to handle more complex tasks.

In this post, I show how you can use Step Functions and Amazon Textract to build a workflow that enables the processing of invoices. Download the code for this solution from https://github.com/aws-samples/aws-step-functions-rpa.

Overview

The following serverless architecture can process scanned invoices in PDF or image formats for submitting payment information to a database.

To implement this architecture, I use single-purpose Lambda functions and Step Functions to build the workflow:

Invoices are scanned and loaded into an Amazon Simple Storage Service (S3) bucket.
The loading of an invoice into Amazon S3 triggers an AWS Lambda function to be invoked.
The Lambda function starts an asynchronous Amazon Textract job to analyze the text and data of the scanned invoice.
The Amazon Textract job publishes a completion notification message with a status of “SUCCEEDED” or “FAILED” to an Amazon Simple Notification Service (SNS) topic.
SNS sends the message to an Amazon Simple Queue Service (SQS) queue that is subscribed to the SNS topic.
The message in the SQS queue triggers another Lambda function.
The Lambda function initiates a Step Functions state machine to process the results of the Amazon Textract job.
For an Amazon Textract job that completes successfully, a Lambda function saves the document analysis into an Amazon S3 bucket.
The loading of the document analysis to Amazon S3 triggers another Lambda function.
The Lambda function retrieves the text and data of the scanned invoice to find the payment information. It writes an item to an Amazon DynamoDB table with a status indicating if the invoice can be processed.
If the DynamoDB item contains the payment information, another Lambda function is invoked.
The Lambda function archives the processed invoice into another S3 bucket.
If the DynamoDB item does not contain the payment information, a message is published to an Amazon SNS topic requesting that the invoice be reviewed.

Amazon Textract can extract information from the various invoice images and associate labels with the data. You must then handle the various labels that different invoices may associate with the payee name, due date, and payment amount.

Determining payee name, due date and payment amount

After the document analysis has been saved to S3, a Lambda function retrieves the text and data of the scanned invoice to find the information needed for payment. However, invoices can use a variety of labels for the same piece of data, such a payment’s due date.

In the example invoices included with this blog, the payment’s due date is associated with the labels “Pay On or Before”, “Payment Due Date” and “Payment Due”. Payment amounts can also have different labels, such as “Total Due”, “New Balance Total”, “Total Current Charges”, and “Please Pay”. To address this, I use a series of helper functions in the app.py file in the process_document_analysis folder of the GitHub repo.

In app.py, there is the following get_ky_map helper function:

def get_kv_map(blocks):
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block
    return key_map, value_map, block_map

The get_kv_map function is invoked by the Lambda function handler. It iterates over the “Blocks” element of the document analysis produced by Amazon Textract to create dictionaries of keys (labels) and values (data) associated with each block identified by Amazon Textract. It then invokes the following get_kv_relationship helper function:

def get_kv_relationship(key_map, value_map, block_map):
    kvs = {}
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)
        kvs[key] = val
    return kvs

The get_kv_relationship function merges the key and value dictionaries produced by the get_kv_map function to create a single Python key value dictionary where labels are the keys to the dictionary and the invoice’s data are the values. The handler then invokes the following get_line_list helper function:

def get_line_list(blocks):
    line_list = []
    for block in blocks:
        if block['BlockType'] == "LINE":
            if 'Text' in block: 
                line_list.append(block["Text"])
    return line_list

Extracting payee names is more complex because the data may not be labeled. The payee may often differ from the entity sending the invoice. With the Amazon Textract analysis in a format more easily consumable by Python, I use the following get_payee_name helper function to parse and extract the payee:

def get_payee_name(lines):
    payee_name = ""
    payable_to = "payable to"
    payee_lines = [line for line in lines if payable_to in line.lower()]
    if len(payee_lines) > 0:
        payee_line = payee_lines[0]
        payee_line = payee_line.strip()
        pos = payee_line.lower().find(payable_to)
        if pos > -1:
            payee_line = payee_line[pos + len(payable_to):]
            if payee_line[0:1] == ':':
                payee_line = payee_line[1:]
            payee_name = payee_line.strip()
    return payee_name

The get_amount helper function searches the key value dictionary produced by the get_kv_relationship function to retrieve the payment amount:

def get_amount(kvs, lines):
    amount = None
    amounts = [search_value(kvs, amount_tag) for amount_tag in amount_tags if search_value(kvs, amount_tag) is not None]
    if len(amounts) > 0:
        amount = amounts[0]
    else:
        for idx, line in enumerate(lines):
            if line.lower() in amount_tags:
                amount = lines[idx + 1]
                break
    if amount is not None:
        amount = amount.strip()
        if amount[0:1] == '$':
            amount = amount[1:]
    return amount

The amount_tags variable contains a list of possible labels associated with the payment amount:

amount_tags = ["total due", "new balance total", "total current charges", "please pay"]

Similarly, the get_due_date helper function searches the key value dictionary produced by the get_kv_relationship function to retrieve the payment due date:

def get_due_date(kvs):
    due_date = None
    due_dates = [search_value(kvs, due_date_tag) for due_date_tag in due_date_tags if search_value(kvs, due_date_tag) is not None]
    if len(due_dates) > 0:
        due_date = due_dates[0]
    if due_date is not None:
        date_parts = due_date.split('/')
        if len(date_parts) == 3:
            due_date = datetime(int(date_parts[2]), int(date_parts[0]), int(date_parts[1])).isoformat()
        else:
            date_parts = [date_part for date_part in re.split("\s+|,", due_date) if len(date_part) > 0]
            if len(date_parts) == 3:
                datetime_object = datetime.strptime(date_parts[0], "%b")
                month_number = datetime_object.month
                due_date = datetime(int(date_parts[2]), int(month_number), int(date_parts[1])).isoformat()
    else:
        due_date = datetime.now().isoformat()
    return due_date

The due_date_tag contains a list of possible labels associated with the payment due:

due_date_tags = ["pay on or before", "payment due date", "payment due"]

If all required elements needed to issue a payment are found, it adds an item to the DynamoDB table with a status attribute of “Approved for Payment”. If the Lambda function cannot determine the value of one or more required elements, it adds an item to the DynamoDB table with a status attribute of “Pending Review”.

Payment Processing

If the item in the DynamoDB table is marked “Approved for Payment”, the processed invoice is archived. If the item’s status attribute is marked “Pending Review”, an SNS message is published to an SNS Pending Review topic. You can subscribe to this topic so that you can add additional labels to the Python code for determining payment due dates and payment amounts.

Note that the Lambda functions are single-purpose functions, and all workflow logic is contained in the Step Functions state machine. This diagram shows the various tasks (states) of a successful workflow.

For more information about this solution, download the code from the GitHub repo (https://github.com/aws-samples/aws-step-functions-rpa).

Prerequisites

Before deploying the solution, you must install the following prerequisites:

Python.
AWS Command Line Interface (AWS CLI) – for instructions, see Installing the AWS CLI.
AWS Serverless Application Model Command Line Interface (AWS SAM CLI) – for instructions, see Installing the AWS SAM CLI.

Deploying the solution

The solution creates the following S3 buckets with names suffixed by your AWS account ID to prevent a global namespace collision of your S3 bucket names:

scanned-invoices-<YOUR AWS ACCOUNT ID>
invoice-analyses-<YOUR AWS ACCOUNT ID>
processed-invoices-<YOUR AWS ACCOUNT ID>

The following steps deploy the example solution in your AWS account. The solution deploys several components including a Step Functions state machine, Lambda functions, S3 buckets, a DynamoDB table for payment information, and SNS topics.

AWS CloudFormation requires an S3 bucket and stack name for deploying the solution. To deploy:

Download code from GitHub repo (https://github.com/aws-samples/aws-step-functions-rpa).
Run the following command to build the artifacts locally on your workstation:sam build
Run the following command to create a CloudFormation stack and deploy your resources:sam deploy --guided --capabilities CAPABILITY_NAMED_IAM

Monitor the progress and wait for the completion of the stack creation process from the AWS CloudFormation console before proceeding.

Testing the solution

To test the solution, upload the PDF test invoices to the S3 bucket named scanned-invoices-<YOUR AWS ACCOUNT ID>.

A Step Functions state machine with the name <YOUR STACK NAME>-ProcessedScannedInvoiceWorkflow runs the workflow. Amazon Textract document analyses are stored in the S3 bucket named invoice-analyses-<YOUR AWS ACCOUNT ID>, and processed invoices are stored in the S3 bucket named processed-invoices-<YOUR AWS ACCOUNT ID>. Processed payments are found in the DynamoDB table named <YOUR STACK NAME>-invoices.

You can monitor the status of the workflows from the Step Functions console. Upon completion of the workflow executions, review the items added to DynamoDB from the Amazon DynamoDB console.

Cleanup

To avoid ongoing charges for any resources you created in this blog post, delete the stack:

Empty the three S3 buckets created during deployment using the S3 console:
– scanned-invoices-<YOUR AWS ACCOUNT ID>
– invoice-analyses-<YOUR AWS ACCOUNT ID>
– processed-invoices-<YOUR AWS ACCOUNT ID>
Delete the CloudFormation stack created during deployment using the CloudFormation console.

Conclusion

In this post, I showed you how to use a Step Functions state machine and Amazon Textract to automatically extract data from a scanned invoice. This eliminates the need for a person to perform the manual step of reviewing an invoice to find payment information to be fed into a backend system. By replacing the manual steps of a workflow with automation, an organization can free up their human workforce to handle more value-added tasks.

To learn more, visit AWS Step Functions and Amazon Textract for more information. For more serverless learning resources, visit https://serverlessland.com.

Building a serverless document scanner using Amazon Textract and AWS Amplify

2020-09-03 Moheeb Zara

Post Syndicated from Moheeb Zara original https://aws.amazon.com/blogs/compute/building-a-serverless-document-scanner-using-amazon-textract-and-aws-amplify/

This guide demonstrates creating and deploying a production ready document scanning application. It allows users to manage projects, upload images, and generate a PDF from detected text. The sample can be used as a template for building expense tracking applications, handling forms and legal documents, or for digitizing books and notes.

The frontend application is written in Vue.js and uses the Amplify Framework. The backend is built using AWS serverless technologies and consists of an Amazon API Gateway REST API that invokes AWS Lambda functions. Amazon Textract is used to analyze text from uploaded images to an Amazon S3 bucket. Detected text is stored in Amazon DynamoDB.

An architectural diagram of the application.

Prerequisites

You need the following to complete the project:

Node.js and npm installed on a computer.
An AWS account. This project can be completed using the AWS Free Tier.

Deploy the application

The solution consists of two parts, the frontend application and the serverless backend. The Amplify CLI deploys all the Amazon Cognito authentication, and hosting resources for the frontend. The backend requires the Amazon Cognito user pool identifier to configure an authorizer on the API. This enables an authorization workflow, as shown in the following image.

A diagram showing how an Amazon Cognito authorization workflow works

First, configure the frontend. Complete the following steps using a terminal running on a computer or by using the AWS Cloud9 IDE. If using AWS Cloud9, create an instance using the default options.

From the terminal:

Install the Amplify CLI by running this command.
```
npm install -g @aws-amplify/cli
```
Configure the Amplify CLI using this command. Follow the guided process to completion.
```
amplify configure
```

Clone the project from GitHub.

git clone https://github.com/aws-samples/aws-serverless-document-scanner.git

Navigate to the amplify-frontend directory and initialize the project using the Amplify CLI command. Follow the guided process to completion.
```
cd aws-serverless-document-scanner/amplify-frontend

amplify init
```
Deploy all the frontend resources to the AWS Cloud using the Amplify CLI command.
```
amplify push
```
After the resources have finishing deploying, make note of the StackName and UserPoolId properties in the amplify-frontend/amplify/backend/amplify-meta.json file. These are required when deploying the serverless backend.

Next, deploy the serverless backend. While it can be deployed using the AWS SAM CLI, you can also deploy from the AWS Management Console:

Navigate to the document-scanner application in the AWS Serverless Application Repository.
In Application settings, name the application and provide the StackName and UserPoolId from the frontend application for the UserPoolID and AmplifyStackName parameters. Provide a unique name for the BucketName parameter.
Choose Deploy.
Once complete, copy the API endpoint so that it can be configured on the frontend application in the next section.

Configure and run the frontend application

Create a file, amplify-frontend/src/api-config.js, in the frontend application with the following content. Include the API endpoint and the unique BucketName from the previous step. The s3_region value must be the same as the Region where your serverless backend is deployed.
```
const apiConfig = {
	"endpoint": "<API ENDPOINT>",
	"s3_bucket_name": "<BucketName>",
	"s3_region": "<Bucket Region>"
};

export default apiConfig;
```
In a terminal, navigate to the root directory of the frontend application and run it locally for testing.
```
cd aws-serverless-document-scanner/amplify-frontend

npm install

npm run serve
```
You should see an output like this:
To publish the frontend application to cloud hosting, run the following command.
```
amplify publish
```
Once complete, a URL to the hosted application is provided.

Using the frontend application

Once the application is running locally or hosted in the cloud, navigating to it presents a user login interface with an option to register. The registration flow requires a code sent to the provided email for verification. Once verified you’re presented with the main application interface.

Once you create a project and choose it from the list, you are presented with an interface for uploading images by page number.

On mobile, it uses the device camera to capture images. On desktop, images are provided by the file system. You can replace an image and the page selector also lets you go back and change an image. The corresponding analyzed text is updated in DynamoDB as well.

Each time you upload an image, the page is incremented. Choosing “Generate PDF” calls the endpoint for the GeneratePDF Lambda function and returns a PDF in base64 format. The download begins automatically.

You can also open the PDF in another window, if viewing a preview in a desktop browser.

Understanding the serverless backend

An architecture diagram of the serverless backend.

In the GitHub project, the folder serverless-backend/ contains the AWS SAM template file and the Lambda functions. It creates an API Gateway endpoint, six Lambda functions, an S3 bucket, and two DynamoDB tables. The template also defines an Amazon Cognito authorizer for the API using the UserPoolID passed in as a parameter:

Parameters:
  UserPoolID:
    Type: String
    Description: (Required) The user pool ID created by the Amplify frontend.

  AmplifyStackName:
    Type: String
    Description: (Required) The stack name of the Amplify backend deployment. 

  BucketName:
    Type: String
    Default: "ds-userfilebucket"
    Description: (Required) A unique name for the user file bucket. Must be all lowercase.  


Globals:
  Api:
    Cors:
      AllowMethods: "'*'"
      AllowHeaders: "'*'"
      AllowOrigin: "'*'"

Resources:

  DocumentScannerAPI:
    Type: AWS::Serverless::Api
    Properties:
      StageName: Prod
      Auth:
        DefaultAuthorizer: CognitoAuthorizer
        Authorizers:
          CognitoAuthorizer:
            UserPoolArn: !Sub 'arn:aws:cognito-idp:${AWS::Region}:${AWS::AccountId}:userpool/${UserPoolID}'
            Identity:
              Header: Authorization
        AddDefaultAuthorizerToCorsPreflight: False

This only allows authenticated users of the frontend application to make requests with a JWT token containing their user name and email. The backend uses that information to fetch and store data in DynamoDB that corresponds to the user making the request.

Two DynamoDB tables are created. A Project table, which tracks all the project names by user, and a Pages table, which tracks pages by project and user. The DynamoDB tables are created by the AWS SAM template with the partition key and range key defined for each table. These are used by the Lambda functions to query and sort items. See the documentation to learn more about DynamoDB table key schema.

ProjectsTable:
    Type: AWS::DynamoDB::Table
    Properties: 
      AttributeDefinitions: 
        - 
          AttributeName: "username"
          AttributeType: "S"
        - 
          AttributeName: "project_name"
          AttributeType: "S"
      KeySchema: 
        - AttributeName: username
          KeyType: HASH
        - AttributeName: project_name
          KeyType: RANGE
      ProvisionedThroughput: 
        ReadCapacityUnits: "5"
        WriteCapacityUnits: "5"

  PagesTable:
    Type: AWS::DynamoDB::Table
    Properties: 
      AttributeDefinitions: 
        - 
          AttributeName: "project"
          AttributeType: "S"
        - 
          AttributeName: "page"
          AttributeType: "N"
      KeySchema: 
        - AttributeName: project
          KeyType: HASH
        - AttributeName: page
          KeyType: RANGE
      ProvisionedThroughput: 
        ReadCapacityUnits: "5"
        WriteCapacityUnits: "5"

When an API Gateway endpoint is called, it passes the user credentials in the request context to a Lambda function. This is used by the CreateProject Lambda function, which also receives a project name in the request body, to create an item in the Project Table and associate it with a user.

The endpoint for the FetchProjects Lambda function is called to retrieve the list of projects associated with a user. The DeleteProject Lambda function removes a specific project from the Project table and any associated pages in the Pages table. It also deletes the folder in the S3 bucket containing all images for the project.

When a user enters a Project, the API endpoint calls the FetchPageCount Lambda function. This returns the number of pages for a project to update the current page number in the upload selector. The project is retrieved from the path parameters, as defined in the AWS SAM template:

FetchPageCount:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.handler
      Runtime: python3.8
      CodeUri: lambda_functions/fetchPageCount/
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref PagesTable
      Environment:
        Variables:
          PAGES_TABLE_NAME: !Ref PagesTable
      Events:
        GetResource:
          Type: Api
          Properties:
            RestApiId: !Ref DocumentScannerAPI
            Path: /pages/count/{project+}
            Method: get

The template creates an S3 bucket and two AWS IAM managed policies. The policies are applied to the AuthRole and UnauthRole created by Amplify. This allows users to upload images directly to the S3 bucket. To understand how Amplify works with Storage, see the documentation.

The template also sets an S3 event notification on the bucket for all object create events with a “.png” suffix. Whenever the frontend uploads an image to S3, the object create event invokes the ProcessDocument Lambda function.

The function parses the object key to get the project name, user, and page number. Amazon Textract then analyzes the text of the image. The object returned by Amazon Textract contains the detected text and detailed information, such as the positioning of text in the image. Only the raw lines of text are stored in the Pages table.

import os
import json, decimal
import boto3
import urllib.parse
from boto3.dynamodb.conditions import Key, Attr

client = boto3.resource('dynamodb')
textract = boto3.client('textract')

tableName = os.environ.get('PAGES_TABLE_NAME')

def handler(event, context):

  table = client.Table(tableName)

  print(table.table_status)
 
  key = urllib.parse.unquote(event['Records'][0]['s3']['object']['key'])
  bucket = event['Records'][0]['s3']['bucket']['name']
  project = key.split('/')[3]
  page = key.split('/')[4].split('.')[0]
  user = key.split('/')[2]
  
  response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': bucket,
            'Name': key
        }
    })
    
  fullText = ""
  
  for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        fullText = fullText + item["Text"] + '\n'
  
  print(fullText)

  table.put_item(Item= {
    'project': user + '/' + project,
    'page': int(page), 
    'text': fullText
    })

  # print(response)
  return

The GeneratePDF Lambda function retrieves the detected text for each page in a project from the Pages table. It combines the text into a PDF and returns it as a base64-encoded string for download. This function can be modified if your document structure differs.

Understanding the frontend

In the GitHub repo, the folder amplify-frontend/src/ contains all the code for the frontend application. In main.js, the Amplify VueJS modules are configured to use the resources defined in aws-exports.js. It also configures the endpoint and S3 bucket of the serverless backend, defined in api-config.js.

In components/DocumentScanner.vue, the API module is imported and the API is defined.

API calls are defined as Vue methods that can be called by various other components and elements of the application.

In components/Project.vue, the frontend uses the Storage module for Amplify to upload images. For more information on how to use S3 in an Amplify project see the documentation.

Conclusion

This blog post shows how to create a multiuser application that can analyze text from images and generate PDF documents. This guide demonstrates how to do so in a secure and scalable way using a serverless approach. The example also shows an event driven pattern for handling high volume image processing using S3, Lambda, and Amazon Textract.

The Amplify Framework simplifies the process of implementing authentication, storage, and backend integration. Explore the full solution on GitHub to modify it for your next project or startup idea.

To learn more about AWS serverless and keep up to date on the latest features, subscribe to the YouTube channel.

#ServerlessForEveryone

#10: Build a serverless retail solution for endless aisle on AWS

#9: Optimizing data with automated intelligent document processing solutions

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

#6: Let’s Architect! Designing event-driven architectures

#5: Use a reusable ETL framework in your AWS lake house architecture

#4: Invoking asynchronous external APIs with AWS Step Functions

#3: Announcing updates to the AWS Well-Architected Framework

#2: Let’s Architect! Designing architectures for multi-tenancy

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Bonus! Three older special mentions

Why it’s challenging to process and manage unstructured data

Solution overview

Process unstructured data with AWS AI services

Implement access control on raw and processed data in Amazon S3

Conclusion

About the Authors

Customer document management challenges

Automated intelligent document processing solution

Solution overview

Customer benefits

Vertical industry use cases

Conclusion

Steps illustrated with more detail:

Cleanup

Conclusion

Why does authorization matter?

Automated document processing architecture

Document ingestion and classification

Document processing workflow

Overview

Entity extraction

Human review

Conclusion

Automating document processing workflow

Components in the solution

Text extraction

Documents classification

Classification model creation

Classification model invocation and processing

Human review

Additional processes

Documents workflow status capturing

Search and package creation

Purge data process

Conclusion

Overview

Determining payee name, due date and payment amount

Payment Processing

Prerequisites

Deploying the solution

Testing the solution

Cleanup

Conclusion

Prerequisites

Deploy the application

Configure and run the frontend application

Using the frontend application

Understanding the serverless backend

Understanding the frontend

Conclusion

The collective thoughts of the interwebz