Tag Archives: Amazon Comprehend

Streamline SMS and Emailing Marketing Compliance with Amazon Comprehend

2024-09-17 Koushik Mani

Post Syndicated from Koushik Mani original https://aws.amazon.com/blogs/messaging-and-targeting/streamline-sms-and-emailing-marketing-compliance-with-amazon-comprehend/

In today’s digital landscape, businesses heavily rely on SMS and email campaigns to engage with customers and deliver timely, relevant messages. The shift towards digital marketing has increased customer engagement, accelerated delivery, and expanded personalization options. Email and SMS marketing is essential to digital strategies according to 44% of Chief Marketing Officers and they allocate approximately 8% of their marketing towards this. Industries face stringent restrictions on the content they can send due to legal regulations and carrier filtering policies.

Messages related to the subjects listed below are considered restricted and are subject to heavy filtering or even being blocked outright. Failing to comply with these restrictions can result in severe consequences, including legal action, fines, and irreparable damage to a brand’s reputation. Marketers need a solution that will proactively scan their content used in campaigns and flag restricted content before sending it out to their customers without facing penalties and losing trust.:

Gambling
High-risk financial services
Debt forgiveness
S.H.A.F.T (Sex, Hate, Alcohol, Firearms, and Tobacco)
Illegal substances

In this blog, we will explore how to leverage Amazon Comprehend, Amazon S3, and AWS Lambda to proactively scan text-based marketing campaigns before publishing content . This solution enables businesses to enhance their marketing efforts while maintaining compliance with industry regulations, avoiding costly fines, and preserving their hard-earned reputation, conforming to best practices.

Solution Overview

AWS provides a robust suite of services to meet the infrastructure needs of the booming digital marketing industry, including messaging capabilities through email, SMS, push, and other channels through Amazon Simple Email Service, Amazon Simple Notification Service, or Amazon Pinpoint.

The main goal for this approach is to flag any message that contains restricted content mentioned above before distribution.

Figure 1: Architecture for proactive scanning of marketing content

Following are the high-level steps:

Upload documents to be scanned to the S3 bucket.
Utilize Amazon Comprehend custom classification for categorizing the documents uploaded.
Create an Amazon Comprehend endpoint to perform analysis.
Inference output is published to the destination S3 bucket.
Utilize AWS Lambda function to consume the output from the destination S3 bucket.
Send the compliant messages through various messaging channels.

Solution Walkthrough

Step 1: Upload Documents to Be Scanned to S3

Sign in to the AWS Management Console and open the Amazon S3 console
In the navigation bar on the top of the page, choose the name of the currently displayed AWS Region. Next, choose the Region in which you want to create a bucket.
In the left navigation pane, choose Buckets.
Choose Create bucket.
- The Create bucket page opens.
Under General configuration, view the AWS Region where your bucket will be created.
Under Bucket type, choose General purpose.
For Bucket name, enter a name for your bucket.
- The bucket name must:
  - Be unique within a partition. A partition is a grouping of Regions. AWS currently has three partitions: aws (Standard Regions), aws-cn (China Regions), and aws-us-gov (AWS GovCloud (US) Regions).
  - Be between 3 and 63 characters long.
  - Consist only of lowercase letters, numbers, dots (.), and hyphens (-). For best compatibility, we recommend that you avoid using dots (.) in bucket names, except for buckets that are used only for static website hosting.
  - Begin and end with a letter or number.
In the Buckets list, choose the name of the bucket that you want to upload your folders or files to.
Choose Upload.
In the Upload window, do one of the following:
- Drag and drop files and folders to the Upload window.
- Choose Add file or Add folder, choose the files or folders to upload, and choose Open.
- To enable versioning, under Destination, choose Enable Bucket Versioning.
- To upload the listed files and folders without configuring additional upload options, at the bottom of the page, choose Upload.
Amazon S3 uploads your objects and folders. When the upload is finished, you see a success message on the Upload: status page.

Step 2: Creating a Custom Classifiction Model

Custom Classification Model

Out-of-the-box models may not capture nuances and terminology specific to an organization’s industry or use case. Therefore, we train a custom model to identify compliant messages.

A custom classification model is a feature that allows you to train a machine learning model to classify text data based on categories that are specific to your use case or industry. It trains the model to recognize and sort different types of content which is used to power the endpoint. A custom classification model is designed to save costs and promote compliant messages and further prevent marketing companies from potential fines.

Requirements for custom classification:

Dataset creation
- A CSV dataset with 1000 examples of marketing messages, each labeled as compliant (1) or non-compliant (0).
- Designed to train a model for accurate predictions on marketing message compliance.

Figure 2: Screenshot of dataset – 20 entries of censored marketing messages

Creating a Test Data Set
In addition to providing a dataset to power your customer classification model, a test dataset is also required to test the data that the model will be running on. Without a test dataset, Amazon Comprehend trains the model with 90 percent of the training data. It reserves 10 percent of the training data to use for testing. When using a test dataset, the test data must include at least one example for each unique label (0 or 1) in the training dataset.

Upload the data set and test data set to an S3 Bucket, by following the steps in this user guide.
In the AWS Console, search for Amazon Comprehend.
Once selected, select custom classification on the left panel.
Once there, select Create new model.
Next specify model settings:
- Model name
- Specify the version (optional)
- Language: English
Specify the data specifications:
- Training model type: Plain Text Documents
- Data format: CSV File
- Classifier Mode: Using Single-Label Mode
- Training Dataset: Give the name of the bucket you created in step 1
- Test Data set: Autosplit, i.e. how much of your data will be used for training and testing.

Specify the location of the model output in S3

Create an IAM Role
- Permissions to access: Train, Test and output data (if specified in your S3 Buckets)
Once all parameters have been identified, select Create.
Once the model has been created, you can view it under Custom Classification. To check and verify the accuracy and F1 score, select the version number of the model. Here, you can view the model details under the Performance tab.

Step 3: Creating an Endpoint in Amazon Comprehend

Next, an endpoint needs to be created to analyze documents. To create an endpoint:

Select endpoint on the left panel in Amazon Comprehend.
Select Create endpoint in the left panel.
Specify Endpoints Settings :
- Provide a name
- Custom model type: Custom Classification
- Choose a custom model that you want to attach to the new endpoint. From the dropdown, you can search by model name.
- Figure 8: Amazon Comprehend – Endpoint settings
Provide the number of inference units (IUs): 1
Once all the parameters have been provided, ensure that the Acknowledge checkbox has been selected.
Finally, select Create endpoint.

Step 4: Scanning Text with the Custom Classification Endpoint

Once the endpoint has been successfully created, it can be used for real-time analysis or batch-processing jobs. Below is a walkthrough of how both options can be achieved.

Real-time analysis:

On the left panel, select Realtime Analysis.
Pick Analysis type: custom, to view real-time insights based on the custom models from an endpoint you’ve created
Select custom model type
Select your Endpoint
Insert your input text.
- For this example, we have used a non-compliant message: Huge sale at Pine County Shooting Range 25% off for 6mm and 9mm bullets! Lazer add-ons on clearance too
Once inserted, click Analyze.
Once analyzed, you will see a confidence score under Classes. Because the dataset is labeled as 0 for non-compliant and 1 for compliant. The message that was inserted was non-compliant, the result of the real-time analysis is a high confidence score for non-compliant.

Real-time analysis in Amazon Comprehend:

On the left panel in Amazon Comprend, select Analysis Jobs.
Select the Create Job button.
Configure Job settings:
- Enter the Name
- Analysis Type: Custom Classification
- Classifications Model: The model you have created for your Classifier, as well as the version number of that model you would like to use for this job.
Enter the location of the Input Data and Output Data in the form of an S3 bucket URL.
Before creating a job the last thing, we want to do is provide the right access permission, by creating an IAM role that give access permissions to the S3 input and output locations.
Once the batch processing job shows a status of completed, you can view the results in the output S3 bucket which was identified earlier. The results will be in a json file where each line represents the confidence score for each marketing message.

Step 5 (optional): Publish message to communication service

The result from the batch processing is automatically uploaded to the output S3 bucket. For each json file uploaded, S3 will initiate an S3 Event Notification which will inform a Lambda function that a new S3 object has been created.

The Lambda function will evaluate the results and automatically identify the messages labeled as compliant (label 0). These compliant messages will then be published to communication services using one of the following three APIs, depending on the desired service:

To automatically trigger the AWS Lambda function, which will read the files uploaded into the S3 bucket and display the data using the Python Pandas library, we will use the boto3 API to read the files from the S3 bucket.

Create an IAM Role in AWS.
Create an AWS S3 bucket.
Create the AWS Lambda function with S3 triggers enabled.
Update the Lambda code with a Python script to read the data and send the communication to customer.

Conclusion

Proactively scanning and classifying marketing content for compliance is a critical aspect of ensuring successful digital marketing campaigns while adhering to industry regulations. Leveraging the powerful combination of Amazon Comprehend, Amazon S3, and AWS Lambda enables the automatic analysis of text-based marketing messages and flagging of any non-compliant content before sending them to your customer. Following these steps provides you with the tools and knowledge to implement proactive scanning for your marketing content. This solution will help mitigate the risks of non-compliance, avoiding costly fines and reputational damage, while freeing up time for your content creation teams to focus on ideation and crafting compelling marketing messages. Regular monitoring and fine-tuning of the custom classification model should be conducted to ensure accurate identification of non-compliant language.

To get started with proactively scanning and classifying marketing content for compliance, see Amazon Comprehend Custom Classification.

About the Authors

Caroline Des Rochers

Caroline is a Solutions Architect at Amazon Web Services, based in Montreal, Canada. She works closely with customers, in both English and French, to accelerate innovation and advise them through technical challenges. In her spare time, she is a passionate skier and cyclist, always ready for new outdoor adventures.

Erika Houde Pearce

Erika Houde-Pearce is a bilingual Solutions Architect in Montreal, guiding small and medium businesses in eastern Canada through their cloud transformations. Her expertise empowers organizations to unlock the full potential of cloud technology, accelerating their growth and agility. Away from work, she spends her spare time with her golden retriever, Scotia.

Koushik Mani

Koushik Mani is a Solutions Architect at AWS. He had worked as a Software Engineer for two years focusing on machine learning and cloud computing use cases at Telstra. He completed his masters in computer science from University of Southern California. He is passionate about machine learning and generative AI use cases and building solutions.

New for Amazon Comprehend – Toxicity Detection

2023-11-09 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-comprehend-toxicity-detection/

With Amazon Comprehend, you can extract insights from text without being a machine learning expert. Using its built-in models, Comprehend can analyze the syntax of your input documents and find entities, events, key phrases, personally identifiable information (PII), and the overall sentiment or sentiments associated with specific entities (such as brands or products).

Today, we are adding the capability to detect toxic content. This new capability helps you build safer environments for your end users. For example, you can use toxicity detection to improve the safety of applications open to external contributions such as comments. When using generative AI, toxicity detection can be used to check the input prompts and the output responses from large language models (LLMs).

You can use toxicity detection with the AWS Command Line Interface (AWS CLI) and AWS SDKs. Let’s see how this works in practice with a few examples using the AWS CLI, an AWS SDK, and to check the use of an LLM.

Using Amazon Comprehend Toxicity Detection with AWS CLI
The new detect-toxic-content subcommand in the AWS CLI detects toxicity in text. The output contains a list of labels, one for each text segment in input. For each text segment, a list is provided with the labels and a score (between 0 and 1).

For example, this AWS CLI command analyzes one text segment and returns one Labels section and an overall Toxicity score for the segment between o and 1:

aws comprehend detect-toxic-content --language-code en --text-segments Text="'Good morning, it\'s a beautiful day.'"

{
    "ResultList": [
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.00039999998989515007
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.01510000042617321
                },
                {
                    "Name": "INSULT",
                    "Score": 0.004699999932199717
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 9.999999747378752e-05
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.0006000000284984708
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.03889999911189079
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.016899999231100082
                }
            ],
            "Toxicity": 0.012299999594688416
        }
    ]
}

As expected, all scores are close to zero, and no toxicity was detected in this text.

To pass input as a file, I first use the AWS CLI --generate-cli-skeleton option to generate a skeleton of the JSON syntax used by the detect-toxic-content command:

aws comprehend detect-toxic-content --generate-cli-skeleton

{
    "TextSegments": [
        {
            "Text": ""
        }
    ],
    "LanguageCode": "en"
}

I write the output to a file and add three text segments (I will not show here the text used to show what happens with toxic content). This time, different levels of toxicity content has been found. Each Labels section is related to the corresponding input text segment.

aws comprehend detect-toxic-content --cli-input-json file://input.json

{
    "ResultList": [
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.03020000085234642
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.12549999356269836
                },
                {
                    "Name": "INSULT",
                    "Score": 0.0738999992609024
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.024399999529123306
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.09510000050067902
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.023900000378489494
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.15549999475479126
                }
            ],
            "Toxicity": 0.06650000065565109
        },
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.03400000184774399
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.2676999866962433
                },
                {
                    "Name": "INSULT",
                    "Score": 0.1981000006198883
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.03139999881386757
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.1777999997138977
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.013000000268220901
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.8395000100135803
                }
            ],
            "Toxicity": 0.41280001401901245
        },
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.9997000098228455
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.39469999074935913
                },
                {
                    "Name": "INSULT",
                    "Score": 0.9265999794006348
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.04650000110268593
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.4203999936580658
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.3353999853134155
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.12409999966621399
                }
            ],
            "Toxicity": 0.8180999755859375
        }
    ]
}

Using Amazon Comprehend Toxicity Detection with AWS SDKs
Similar to what I did with the AWS CLI, I can use an AWS SDK to programmatically detect toxicity in my applications. The following Python script uses the AWS SDK for Python (Boto3) to detect toxicity in the text segments and print the labels if the score is greater than a specified threshold. In the code, I redacted the content of the second and third text segments and replaced it with ***.

import boto3

comprehend = boto3.client('comprehend')

THRESHOLD = 0.2
response = comprehend.detect_toxic_content(
    TextSegments=[
        {
            "Text": "You can go through the door go, he's waiting for you on the right."
        },
        {
            "Text": "***"
        },
        {
            "Text": "***"
        }
    ],
    LanguageCode='en'
)

result_list = response['ResultList']

for i, result in enumerate(result_list):
    labels = result['Labels']
    detected = [ l for l in labels if l['Score'] > THRESHOLD ]
    if len(detected) > 0:
        print("Text segment {}".format(i + 1))
        for d in detected:
            print("{} score {:.2f}".format(d['Name'], d['Score']))

I run the Python script. The output contains the labels and the scores detected in the second and third text segments. No toxicity is detected in the first text segment.

Text segment 2
HATE_SPEECH score 0.27
VIOLENCE_OR_THREAT score 0.84
Text segment 3
PROFANITY score 1.00
HATE_SPEECH score 0.39
INSULT score 0.93
HARASSMENT_OR_ABUSE score 0.42
SEXUAL score 0.34

Using Amazon Comprehend Toxicity Detection with LLMs
I deployed the Mistral 7B model using Amazon SageMaker JumpStart as described in this blog post.

To avoid toxicity in the responses of the model, I built a Python script with three functions:

query_endpoint invokes the Mistral 7B model using the endpoint deployed by SageMaker JumpStart.
check_toxicity uses Comprehend to detect toxicity in a text and return a list of the detected labels.
avoid_toxicity takes in input a list of the detected labels and returns a message describing what to do to avoid toxicity.

The query to the LLM goes through only if no toxicity is detected in the input prompt. Then, the response from the LLM is printed only if no toxicity is detected in output. In case toxicity is detected, the script provides suggestions on how to fix the input prompt.

Here’s the code of the Python script:

import json
import boto3

comprehend = boto3.client('comprehend')
sagemaker_runtime = boto3.client("runtime.sagemaker")

ENDPOINT_NAME = "<REPLACE_WITH_YOUR_SAGEMAKER_JUMPSTART_ENDPOINT>"
THRESHOLD = 0.2


def query_endpoint(prompt):
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 68,
            "no_repeat_ngram_size": 3,
        },
    }
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME, ContentType="application/json", Body=json.dumps(payload).encode("utf-8")
    )
    model_predictions = json.loads(response["Body"].read())
    generated_text = model_predictions[0]["generated_text"]
    return generated_text


def check_toxicity(text):
    response = comprehend.detect_toxic_content(
        TextSegments=[
            {
                "Text":  text
            }
        ],
        LanguageCode='en'
    )

    labels = response['ResultList'][0]['Labels']
    detected = [ l['Name'] for l in labels if l['Score'] > THRESHOLD ]

    return detected


def avoid_toxicity(detected):
    formatted = [ d.lower().replace("_", " ") for d in detected ]
    message = (
        "Avoid content that is toxic and is " +
        ", ".join(formatted) + ".\n"
    )
    return message


prompt = "Building a website can be done in 10 simple steps:"

detected_labels = check_toxicity(prompt)

if len(detected_labels) > 0:
    # Toxicity detected in the input prompt
    print("Please fix the prompt.")
    print(avoid_toxicity(detected_labels))
else:
    response = query_endpoint(prompt)

    detected_labels = check_toxicity(response)

    if len(detected_labels) > 0:
        # Toxicity detected in the output response
        print("Here's an improved prompt:")
        prompt = avoid_toxicity(detected_labels) + prompt
        print(prompt)
    else:
        print(response)

You’ll not get a toxic response with the sample prompt in the script, but it’s safe to know that you can set up an automatic process to check and mitigate if that happens.

Availability and Pricing
Toxicity detection for Amazon Comprehend is available today in the following AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), and Asia Pacific (Sydney).

When using toxicity detection, there are no long-term commitments, and you pay based on the number of input characters in units of 100 characters (1 unit = 100 characters), with a minimum charge of 3 units (300 character) per request. For more information, see Amazon Comprehend pricing.

Improve the safety of your online communities and simplify the adoption of LLMs in your applications with toxicity detection.

— Danilo

Unstructured data management and governance using AWS AI/ML and analytics services

2023-10-25 Sakti Mishra

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/unstructured-data-management-and-governance-using-aws-ai-ml-and-analytics-services/

Unstructured data is information that doesn’t conform to a predefined schema or isn’t organized according to a preset data model. Unstructured information may have a little or a lot of structure but in ways that are unexpected or inconsistent. Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. By some estimates, unstructured data can make up to 80–90% of all new enterprise data and is growing many times faster than structured data. After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data.

In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.

Why it’s challenging to process and manage unstructured data

Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). Understanding the data, categorizing it, storing it, and extracting insights from it can be challenging. In addition, identifying incremental changes requires specialized patterns and detecting sensitive data and meeting compliance requirements calls for sophisticated functions. It can be difficult to integrate unstructured data with structured data from existing information systems. Some view structured and unstructured data as apples and oranges, instead of being complementary. But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Solution overview

Data and metadata discovery is one of the primary requirements in data analytics, where data consumers explore what data is available and in what format, and then consume or query it for analysis. If you can apply a schema on top of the dataset, then it’s straightforward to query because you can load the data into a database or impose a virtual table schema for querying. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable.

You can integrate different technologies or tools to build a solution. In this post, we explain how to integrate different AWS services to provide an end-to-end solution that includes data extraction, management, and governance.

The solution integrates data in three tiers. The first is the raw input data that gets ingested by source systems, the second is the output data that gets extracted from input data using AI, and the third is the metadata layer that maintains a relationship between them for data discovery.

The following is a high-level architecture of the solution we can build to process the unstructured data, assuming the input data is being ingested to the raw input object store.

Unstructured Data Management - Block Level Architecture Diagram

The steps of the workflow are as follows:

Integrated AI services extract data from the unstructured data.
These services write the output to a data lake.
A metadata layer helps build the relationship between the raw data and AI extracted output. When the data and metadata are available for end-users, we can break the user access pattern into additional steps.
In the metadata catalog discovery step, we can use query engines to access the metadata for discovery and apply filters as per our analytics needs. Then we move to the next stage of accessing the actual data extracted from the raw unstructured data.
The end-user accesses the output of the AI services and uses the query engines to query the structured data available in the data lake. We can optionally integrate additional tools that help control access and provide governance.
There might be scenarios where, after accessing the AI extracted output, the end-user wants to access the original raw object (such as media files) for further analysis. Additionally, we need to make sure we have access control policies so the end-user has access only to the respective raw data they want to access.

Now that we understand the high-level architecture, let’s discuss what AWS services we can integrate in each step of the architecture to provide an end-to-end solution.

The following diagram is the enhanced version of our solution architecture, where we have integrated AWS services.

Unstructured Data Management - AWS Native Architecture

Let’s understand how these AWS services are integrated in detail. We have divided the steps into two broad user flows: data processing and metadata enrichment (Steps 1–3) and end-users accessing the data and metadata with fine-grained access control (Steps 4–6).

Various AI services (which we discuss in the next section) extract data from the unstructured datasets.
The output is written to an Amazon Simple Storage Service (Amazon S3) bucket (labeled Extracted JSON in the preceding diagram). Optionally, we can restructure the input raw objects for better partitioning, which can help while implementing fine-grained access control on the raw input data (labeled as the Partitioned bucket in the diagram).
After the initial data extraction phase, we can apply additional transformations to enrich the datasets using AWS Glue. We also build an additional metadata layer, which maintains a relationship between the raw S3 object path, the AI extracted output path, the optional enriched version S3 path, and any other metadata that will help the end-user discover the data.
In the metadata catalog discovery step, we use the AWS Glue Data Catalog as the technical catalog, Amazon Athena and Amazon Redshift Spectrum as query engines, AWS Lake Formation for fine-grained access control, and Amazon DataZone for additional governance.
The AI extracted output is expected to be available as a delimited file or in JSON format. We can create an AWS Glue Data Catalog table for querying using Athena or Redshift Spectrum. Like the previous step, we can use Lake Formation policies for fine-grained access control.
Lastly, the end-user accesses the raw unstructured data available in Amazon S3 for further analysis. We have proposed integrating Amazon S3 Access Points for access control at this layer. We explain this in detail later in this post.

Now let’s expand the following parts of the architecture to understand the implementation better:

Using AWS AI services to process unstructured data
Using S3 Access Points to integrate access control on raw S3 unstructured data

Process unstructured data with AWS AI services

As we discussed earlier, unstructured data can come in a variety of formats, such as text, audio, video, and images, and each type of data requires a different approach for extracting metadata. AWS AI services are designed to extract metadata from different types of unstructured data. The following are the most commonly used services for unstructured data processing:

Amazon Comprehend – This natural language processing (NLP) service uses ML to extract metadata from text data. It can analyze text in multiple languages, detect entities, extract key phrases, determine sentiment, and more. With Amazon Comprehend, you can easily gain insights from large volumes of text data such as extracting product entity, customer name, and sentiment from social media posts.
Amazon Transcribe – This speech-to-text service uses ML to convert speech to text and extract metadata from audio data. It can recognize multiple speakers, transcribe conversations, identify keywords, and more. With Amazon Transcribe, you can convert unstructured data such as customer support recordings into text and further derive insights from it.
Amazon Rekognition – This image and video analysis service uses ML to extract metadata from visual data. It can recognize objects, people, faces, and text, detect inappropriate content, and more. With Amazon Rekognition, you can easily analyze images and videos to gain insights such as identifying entity type (human or other) and identifying if the person is a known celebrity in an image.
Amazon Textract – You can use this ML service to extract metadata from scanned documents and images. It can extract text, tables, and forms from images, PDFs, and scanned documents. With Amazon Textract, you can digitize documents and extract data such as customer name, product name, product price, and date from an invoice.
Amazon SageMaker – This service enables you to build and deploy custom ML models for a wide range of use cases, including extracting metadata from unstructured data. With SageMaker, you can build custom models that are tailored to your specific needs, which can be particularly useful for extracting metadata from unstructured data that requires a high degree of accuracy or domain-specific knowledge.
Amazon Bedrock – This fully managed service offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API. It also offers a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security.

With these specialized AI services, you can efficiently extract metadata from unstructured data and use it for further analysis and insights. It’s important to note that each service has its own strengths and limitations, and choosing the right service for your specific use case is critical for achieving accurate and reliable results.

AWS AI services are available via various APIs, which enables you to integrate AI capabilities into your applications and workflows. AWS Step Functions is a serverless workflow service that allows you to coordinate and orchestrate multiple AWS services, including AI services, into a single workflow. This can be particularly useful when you need to process large amounts of unstructured data and perform multiple AI-related tasks, such as text analysis, image recognition, and NLP.

With Step Functions and AWS Lambda functions, you can create sophisticated workflows that include AI services and other AWS services. For instance, you can use Amazon S3 to store input data, invoke a Lambda function to trigger an Amazon Transcribe job to transcribe an audio file, and use the output to trigger an Amazon Comprehend analysis job to generate sentiment metadata for the transcribed text. This enables you to create complex, multi-step workflows that are straightforward to manage, scalable, and cost-effective.

The following is an example architecture that shows how Step Functions can help invoke AWS AI services using Lambda functions.

AWS AI Services - Lambda Event Workflow -Unstructured Data

The workflow steps are as follows:

Unstructured data, such as text files, audio files, and video files, are ingested into the S3 raw bucket.
A Lambda function is triggered to read the data from the S3 bucket and call Step Functions to orchestrate the workflow required to extract the metadata.
The Step Functions workflow checks the type of file, calls the corresponding AWS AI service APIs, checks the job status, and performs any postprocessing required on the output.
AWS AI services can be accessed via APIs and invoked as batch jobs. To extract metadata from different types of unstructured data, you can use multiple AI services in sequence, with each service processing the corresponding file type.
After the Step Functions workflow completes the metadata extraction process and performs any required postprocessing, the resulting output is stored in an S3 bucket for cataloging.

Next, let’s understand how can we implement security or access control on both the extracted output as well as the raw input objects.

Implement access control on raw and processed data in Amazon S3

We just consider access controls for three types of data when managing unstructured data: the AI-extracted semi-structured output, the metadata, and the raw unstructured original files. When it comes to AI extracted output, it’s in JSON format and can be restricted via Lake Formation and Amazon DataZone. We recommend keeping the metadata (information that captures which unstructured datasets are already processed by the pipeline and available for analysis) open to your organization, which will enable metadata discovery across the organization.

To control access of raw unstructured data, you can integrate S3 Access Points and explore additional support in the future as AWS services evolve. S3 Access Points simplify data access for any AWS service or customer application that stores data in Amazon S3. Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket. With S3 Access Points, you can create unique access control policies for each access point to easily control access to specific datasets within an S3 bucket. This works well in multi-tenant or shared bucket scenarios where users or teams are assigned to unique prefixes within one S3 bucket.

An access point can support a single user or application, or groups of users or applications within and across accounts, allowing separate management of each access point. Every access point is associated with a single bucket and contains a network origin control and a Block Public Access control. For example, you can create an access point with a network origin control that only permits storage access from your virtual private cloud (VPC), a logically isolated section of the AWS Cloud. You can also create an access point with the access point policy configured to only allow access to objects with a defined prefix or to objects with specific tags. You can also configure custom Block Public Access settings for each access point.

The following architecture provides an overview of how an end-user can get access to specific S3 objects by assuming a specific AWS Identity and Access Management (IAM) role. If you have a large number of S3 objects to control access, consider grouping the S3 objects, assigning them tags, and then defining access control by tags.

S3 Access Points - Unstructured Data Management - Access Control

If you are implementing a solution that integrates S3 data available in multiple AWS accounts, you can take advantage of cross-account support for S3 Access Points.

Conclusion

This post explained how you can use AWS AI services to extract readable data from unstructured datasets, build a metadata layer on top of them to allow data discovery, and build an access control mechanism on top of the raw S3 objects and extracted data using Lake Formation, Amazon DataZone, and S3 Access Points.

In addition to AWS AI services, you can also integrate large language models with vector databases to enable semantic or similarity search on top of unstructured datasets. To learn more about how to enable semantic search on unstructured data by integrating Amazon OpenSearch Service as a vector database, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

As of writing this post, S3 Access Points is one of the best solutions to implement access control on raw S3 objects using tagging, but as AWS service features evolve in the future, you can explore alternative options as well.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Bhavana Chirumamilla is a Senior Resident Architect at AWS with a strong passion for data and machine learning operations. She brings a wealth of experience and enthusiasm to help enterprises build effective data and ML strategies. In her spare time, Bhavana enjoys spending time with her family and engaging in various activities such as traveling, hiking, gardening, and watching documentaries.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and trade-offs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family—usually on tennis courts.

Daniel Bruno is a Principal Resident Architect at AWS. He had been building analytics and machine learning solutions for over 20 years and splits his time helping customers build data science programs and designing impactful ML products.

Optimizing data with automated intelligent document processing solutions

2023-04-28 Deependra Shekhawat

Post Syndicated from Deependra Shekhawat original https://aws.amazon.com/blogs/architecture/optimizing-data-with-automated-intelligent-document-processing-solutions/

Many organizations struggle to effectively manage and derive insights from the large amount of unstructured data locked in emails, PDFs, images, scanned documents, and more. The variety of formats, document layouts, and text makes it difficult for any standard Optical Character Recognition (OCR) to extract key insights from these data sources.

To help organizations overcome these document management and information extraction challenges, AWS offers connected, pre-trained artificial intelligence (AI) service APIs that help drive business outcomes from these document-based rich data sources.

This blog post describes a cost-effective, scalable automated intelligent document processing solution that leverages a Natural Processing Language (NLP) engine using Amazon Textract and Amazon Comprehend. This solution helps customers take advantage of industry leading machine learning (ML) technology in their document workflows without the need for in-house ML expertise.

Customer document management challenges

Customers across industry verticals experience the following document management challenges:

Extraction process accuracy varies significantly when applied to diverse sources; specifically handwritten text, images, and scanned documents.
Existing scripting and rule-based solutions cannot provide customer domain or problem-specific classifiers.
Traditional document management systems cannot consider feedback from domain experts to improve the learning process.
The Personally Identifiable Information (PII) data-handling is not robust or customizable, causing data privacy leakage concern.
Many manual interventions are required to complete the entire process.

Automated intelligent document processing solution

We introduced an automated intelligent document processing implementation to address key document management challenges. At the heart of the solution is a NLP engine that combines:

Amazon Textract
Amazon Comprehend
Amazon SageMaker
Custom regular expression-based Python parser

The full solution also leverages other AWS services as described in the following diagram (Figure 1) and steps to develop and operate a cost-effective and scalable architecture for document processing. It effectively extracts text from document types including PDFs, images, scanned documents, Microsoft Excel workbooks, and more.

Figure 1: AI-based intelligent document processing engine

Solution overview

Let’s explore the automated intelligent document processing solution step by step.

The document upload engine or business users upload the respective files or documents through a custom web application to the designated Amazon Simple Storage Service (Amazon S3) bucket.
The event-based architecture signals an Amazon S3 push event to invoke the respective AWS Lambda function to start document pre-processing.
The Lambda function evaluates the document payload, leverages Amazon Simple Queue Service (Amazon SQS) for async processing, prepares document metadata, stores it in Amazon DynamoDB, and calls the NLP engine to perform the information extraction process.
The NLP engine leverages Amazon Textract for text extraction from a variety of sources and leverages document metadata to optimize the appropriate API calls (for example, form, tabular, or PDF).
- Amazon Textract output is fed into Amazon Comprehend which consumes the extracted text and performs entity parsing, line/paragraph-based sentiment analysis, and document/paragraph classification. For better accuracy, we leverage a custom classifier within Amazon Comprehend.
- Amazon Comprehend also provides key APIs to mask PII data before it is used for any further consumption. The solution offers the ability to configure masking rules for each PII entity per masking requirements.
- To ensure the solution has capability to handle data from Microsoft Excel workbooks, we developed a custom parser using Python running inside an AWS Lambda function. Depending on the document metadata, this function can be invoked.
Output of Amazon Comprehend is then fed to ML models deployed using Amazon SageMaker depending on additional use cases configured by the customer to complement the overall process with ML-based recommendations, predictions, and personalization.
Once the NLP engine completes its processing, the job completion notification event signals another AWS Lambda function and updates the status in the respective Amazon SQS queue.
The Lambda post-processing function parses the resultant content generated by the NLP engine and stores it in the Amazon DynamoDB and Amazon S3 bucket. This step is responsible for the required data augmentation, key entities validation, and default value assignment to create a data structure that could be consumed by the presentation/visualization layer.
Users get the flexibility to see the extracted information and compare it with the original document extract in the custom user interface (UI). They can provide their feedback on extraction and entity parsing accuracy. From a user access management perspective, Amazon Cognito provides authorization and authentication.

Customer benefits

The automated intelligent document processing solution helps customers:

Increase overall document management efficiency by 50-60%, leveraging automation and nullifying manual interventions
Reduce in-house team involvement in administrative activities by up to 70% using integrated and connected processing workflows
Gain better visibility into key contractual obligations with features such as Document Classification (helps properly route documents to the respective process/team) and Obligation Extraction
Utilize a UI-based feedback mechanism for in-house domain experts/reviewers to see and validate the extracted information and offer feedback to inform further model training

From a cost-optimization perspective, depending on document type and required information, only the respective Amazon Textract APIs calls are submitted. (For example, it is not worth using form/table-based Textract API calls for a Know Your Customer (KYC) document such as a driver’s license or passport when the AnalyzeID API is the most efficient solution.)

To maximize solution benefits, customers should invest time in building well-defined taxonomies ahead of using the document processing solution to accommodate their own use cases or industry domain-specific requirements. Their taxonomy input highlights only relevant keys and takes respective actions in case the requires keys are not extracted.

Vertical industry use cases

As mentioned, this document processing solution can be used across industry segments. Let’s explore some practical use cases. For example, it can help insurance industry professionals to accelerate claim processing and customer KYC-related processes. By extracting the key entities from the claim documents, mapping them against the customer defined taxonomy, and integrating with Amazon SageMaker models for anomaly detection (anomalous claims), insurance providers can improve claim management and customer satisfaction.

In the healthcare industry, the solution can help with medical records and report processing, key medical entity extraction, and customer data masking.

The document processing solution can help the banking industry by automating check processing and delivering the ability to extract key entities like payer, payee, date, and amount from the checks.

Conclusion

Manual document processing is resource-intensive, time consuming, and costly. Customers need to allocate resources to process large volume documents, lowering business agility. Their employees are performing manual “stare and compare” tasks, potentially reducing worker morale and preventing them from focusing where their efforts are better placed.

Intelligent document processing helps businesses overcome these challenges by automating the classification, extraction, and analysis of data. This expedites decision cycles, allocates resources to high-value tasks, and reduces costs.

Pre-trained APIs of AWS AI services allow for quick classification, extraction, and data analyzation from scores of documents. This solution also has industry specific features that can quickly process specialized industry specific documents. This blog discussed the foundational architecture to helps to accelerate implementation of any specific document processing use case.

New – Process PDFs, Word Documents, and Images with Amazon Comprehend for IDP

2022-12-01 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/now-process-pdfs-word-documents-and-images-with-amazon-comprehend-for-idp/

Today we are announcing a new Amazon Comprehend feature for intelligent document processing (IDP). This feature allows you to classify and extract entities from PDF documents, Microsoft Word files, and images directly from Amazon Comprehend without you needing to extract the text first.

Many customers need to process documents that have a semi-structured format, like images of receipts that were scanned or tax statements in PDF format. Until today, those customers first needed to preprocess those documents using optical character recognition (OCR) tools to extract the text. Then they could use Amazon Comprehend to classify and extract entities from those preprocessed files.

Now with Amazon Comprehend for IDP, customers can process their semi-structured documents, such as PDFs, docx, PNG, JPG, or TIFF images, as well as plain-text documents, with a single API call. This new feature combines OCR and Amazon Comprehend’s existing natural language processing (NLP) capabilities to classify and extract entities from the documents. The custom document classification API allows you to organize documents into categories or classes, and the custom-named entity recognition API allows you to extract entities from documents like product codes or business-specific entities. For example, an insurance company can now process scanned customers’ claims with fewer API calls. Using the Amazon Comprehend entity recognition API, they can extract the customer number from the claims and use the custom classifier API to sort the claim into the different insurance categories—home, car, or personal.

Starting today, Amazon Comprehend for IDP APIs are available for real-time inferencing of files, as well as for asynchronous batch processing on large document sets. This feature simplifies the document processing pipeline and reduces development effort.

Getting Started
You can use Amazon Comprehend for IDP from the AWS Management Console, AWS SDKs, or AWS Command Line Interface (CLI).

In this demo, you will see how to asynchronously process a semi-structured file with a custom classifier. For extracting entities, the steps are different, and you can learn how to do it by checking the documentation.

In order to process a file with a classifier, you will first need to train a custom classifier. You can follow the steps in the Amazon Comprehend Developer Guide. You need to train this classifier with plain text data.

After you train your custom classifier, you can classify documents using either asynchronous or synchronous operations. For using the synchronous operation to analyze a single document, you need to create an endpoint to run real-time analysis using a custom model. You can find more information about real-time analysis in the documentation. For this demo, you are going to use the asynchronous operation, placing the documents to classify in an Amazon Simple Storage Service (Amazon S3) bucket and running an analysis batch job.

To get started classifying documents in batch from the console, on the Amazon Comprehend page, go to Analysis jobs and then Create job.

Then you can configure the new analysis job. First, input a name and pick Custom classification and the custom classifier you created earlier.

Then you can configure the input data. First, select the S3 location for that data. In that location, you can place your PDFs, images, and Word Documents. Because you are processing semi-structured documents, you need to choose One document per file. If you want to override Amazon Comprehend settings for extracting and parsing the document, you can configure the Advanced document input options.

After configuring the input data, you can select where the output of this analysis should be stored. Also, you need to give access permissions for this analysis job to read and write on the specified Amazon S3 locations, and then you are ready to create the job.

The job takes a few minutes to run, depending on the size of the input. When the job is ready, you can check the output results. You can find the results in the Amazon S3 location you specified when you created the job.

In the results folder, you will find a .out file for each of the semi-structured files Amazon Comprehend classified. The .out file is a JSON, in which each line represents a page of the document. In the amazon-textract-output directory, you will find a folder for each classified file, and inside that folder, there is one file per page from the original file. Those page files contain the classification results. To learn more about the outputs of the classifications, check the documentation page.

Available Now
You can get started classifying and extracting entities from semi-structured files like PDFs, images, and Word Documents asynchronously and synchronously today from Amazon Comprehend in all the Regions where Amazon Comprehend is available. Learn more about this new launch in the Amazon Comprehend Developer Guide.

— Marcia

Automate your Data Extraction for Oil Well Data with Amazon Textract

2022-02-24 Ashutosh Pateriya

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/automate-your-data-extraction-for-oil-well-data-with-amazon-textract/

Traditionally, many businesses archive physical formats of their business documents. These can be invoices, sales memos, purchase orders, vendor-related documents, and inventory documents. As more and more businesses are moving towards digitizing their business processes, it is becoming challenging to effectively manage these documents and perform business analytics on them. For example, in the Oil and Gas (O&G) industry, companies have numerous documents that are generated through the exploration and production lifecycle of an oil well. These documents can provide many insights that can help inform business decisions.

As documents are usually stored in a paper format, information retrieval can be time consuming and cumbersome. Even those available in a digital format may not have adequate metadata associated to efficiently perform search and build insights.

In this post, you will learn how to build a text extraction solution using Amazon Textract service. This will automatically extract text and data from scanned documents and upload into Amazon Simple Storage Service (S3). We will show you how to find insights and relationships in the extracted text using Amazon Comprehend. This data is indexed and populated into Amazon OpenSearch Service to search and visualize it in a Kibana dashboard.

Figure 1 illustrates a solution built with AWS, which extracts O&G well data information from PDF documents. This solution is serverless and built using AWS Managed Services. This will help you to decrease system maintenance overhead while making your solution scalable and reliable.

Figure 1. Automated form data extraction architecture

Following are the high-level steps:

Upload an image file or PDF document to Amazon S3 for analysis. Amazon S3 is a durable document storage used for central document management.
Amazon S3 event initiates the AWS Lambda function Fn-A. AWS Lambda has functional logic to call the Amazon Textract and Comprehend services and processing.
AWS Lambda function Fn-A invokes Amazon Textract to extract text as key-value pairs from image or PDF. Amazon Textract automatically extracts data from the scanned documents.
Amazon Textract sends the extracted keys from image/PDF to Amazon SNS.
Amazon SNS notifies Amazon SQS when text extraction is complete by sending the extracted keys to Amazon SQS.
Amazon SQS initiates AWS Lambda function Fn-B with the extracted keys.
AWS Lambda function Fn-B invokes Amazon Comprehend for the custom entity recognition. Comprehend uses custom-trained machine learning (ML) to find discrepancies in key names from Amazon Textract.
The data is indexed and loaded into Amazon OpenSearch, which indexes and visualizes the data.
Kibana processes the indexed data.
User accesses Kibana to search documents.

Steps illustrated with more detail:

1. User uploads the document for analysis to Amazon S3. Uploaded document can be an image file or a PDF. Here we are using the S3 console for document upload. Figure 2 shows the sample file used for this demo.

Figure 2. Sample input form

2. Amazon S3 upload event initiates AWS Lambda function Fn-A. Refer to the AWS tutorial to learn about S3 Lambda configuration. View Sample code for Lambda FunctionA.

3. AWS Lambda function Fn-A invokes Amazon Textract. Amazon Textract uses artificial intelligence (AI) to read as a human would, by extracting text, layouts, tables, forms, and structured data with context and without configuration, training, or custom code.

4. Amazon Textract starts processing the file as it is uploaded. This process takes few minutes since the file is a multipage document.

5. Amazon SNS notifies Amazon Textract of completion. Amazon Textract processing works asynchronously, as we decouple our architecture using Amazon SQS. To configure Amazon SNS to send data to Amazon SQS:

Create an SNS topic. ‘AmazonTextract-SNS’ is the SNS topic that we created for this demo.
Then create an SQS queue. ‘AmazonTextract-SQS’ is the queue that we created for this demo.
To receive messages published to a topic, you must subscribe an endpoint to the topic. When you subscribe an endpoint to a topic, the endpoint begins to receive messages published to the associated topic. Figure 3 shows the SNS topic ‘AmazonTextract-SNS’ subscribed to Amazon SQS queue.

Figure 3. Amazon SNS configuration

Figure 4. Amazon SQS configuration

6. Configure SQS queue to initiate the AWS Lambda function Fn-B. This should happen upon receiving extracted data via SNS topic. Refer to this SQS tutorial to learn about SQS Lambda configuration. See Sample code for Lambda FunctionB.

7. AWS Lambda function Fn-B invokes Amazon Comprehend for the custom entity recognition.

Figure 5. Lambda FunctionB configuration in Amazon Comprehend

Configure Amazon Comprehend to create a custom entity recognition (text-job2) for the entities. These can be API Number, Lease_Number, Water_Depth, Well_Number, and can use the model created in previous step (well_no, well#, well num). For instructions on labeling your data, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

Figure 6. Comprehend job

Now create an endpoint for the custom entity recognition for the Lambda function, to send the data to Amazon Comprehend service, as shown in Figure 7 and 8.

Figure 7. Comprehend endpoint creation

Copy the Amazon Comprehend endpoint ARN to include it in the Lambda function as an environment variable (see Figure 5).

Figure 8. Comprehend endpoint created successfully

8. Launch an Amazon OpenSearch domain. See Creating and managing Amazon OpenSearch Service domains. The data is indexed and populated into Amazon OpenSearch. The Amazon OpenSearch domain name is configured at Lambda FnB as an environment variable to push the extracted data to OpenSearch.

9. Kibana processes the indexed data from Amazon OpenSearch. Amazon OpenSearch data is populated on Kibana, shown in Figure 9.

Figure 9. Kibana dashboard showing Amazon OpenSearch data

10. Access Kibana for document search. The selected fields can be viewed as a table using filters, see Figure 10.

Figure 10. Kibana dashboard table view for selected fields

You can search the LEASE_NUMBER = OCS-031, as shown in Figure 11.

Figure 11. Kibana dashboard search on Lease Number

OR you can search all the information for the WATER_DEPTH = 60, see Figure 12.

Figure 12. Kibana dashboard search on Water Depth

Cleanup

Shut down OpenSearch domain
Delete the Comprehend endpoint
Clear objects from S3 bucket

Conclusion

Data is growing at an enormous pace in all industries. As we have shown, you can build an ML-based text extraction solution to uncover the unstructured data from PDFs or images. You can derive intelligence from diverse data sources by incorporating a data extraction and optimization function. You can gain insights into the undiscovered data, by leveraging managed ML services, Amazon Textract, and Amazon Comprehend.

The extracted data from PDFs or images is indexed and populated into Amazon OpenSearch. You can use Kibana to search and visualize the data. By implementing this solution, customers can reduce the costs of physical document storage, in addition to labor costs for manually identifying relevant information.

This solution will drive decision-making efficiency. We discussed the oil and gas industry vertical as an example for this blog. But this solution can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders.

For further reading:

Scale Up Language Detection with Amazon Comprehend and S3 Batch Operations

2021-11-16 Ameer Hakme

Post Syndicated from Ameer Hakme original https://aws.amazon.com/blogs/architecture/scale-up-language-detection-with-amazon-comprehend-and-s3-batch-operations/

Organizations have been collecting text data for years. Text data can help you intelligently address a range of challenges, from customer experience to analytics. These mixed language, unstructured datasets can contain a wealth of information within business documents, emails, and webpages. If you’re able to process and interpret it, this information can provide insight that can help guide your business decisions.

Amazon Comprehend is a natural language processing (NLP) service that extracts insights from text datasets. Amazon Comprehend asynchronous batch operations provides organizations with the ability to detect dominant languages from text documents stored in Amazon Simple Storage Service (S3) buckets. The asynchronous operations support a maximum document size of 1 MB for language detection. They can process up to one million documents per batch, for a total size of 5 GB.

But what if your organization has millions, or even billions of documents stored in an S3 bucket waiting for language detection processing? What if your language detection process requires customization to let you organize your documents based on language? What if you need to create a search index that can help you quickly audit your text document repositories?

In this blog post, we walk through a solution using Amazon S3 Batch Operations to initiate language detection jobs with AWS Lambda and Amazon Comprehend.

Real world language detection solution architecture

In our example, we have tens of millions of text objects stored in a single S3 bucket. These need to be processed to detect the dominant language. To create a language detection job, we must supply the S3 Batch Operations with a manifest file that lists all text objects. We can use an Amazon S3 Inventory report as an input to the manifest file to create S3 bucket object lists.

One of the supported S3 Batch Operations is invoking an AWS Lambda function. The S3 Batch Operations job uses LambdaInvoke to run a Lambda function on every object listed in a manifest. Lambda jobs are subject to overall Lambda concurrency limits for the account and each Lambda invocation will have a defined runtime. Organizations can request a service quota increase if necessary. Lambda functions in a single AWS account and in one Region share the concurrency limit. You can set reserved capacity for Lambda functions to ensure that they can be invoked even when overall capacity has been exhausted.

The Lambda function can be customized to take further actions based on the output received from Amazon Comprehend. The following diagram shows an architecture for language detection with S3 Batch Operations and Amazon Comprehend.

Figure 1. Language detection with S3 Batch Operations and Amazon Comprehend

Here is the architecture flow, as shown in Figure 1:

S3 Batch Operations will pull the manifest file from the source S3 bucket.
The S3 Batch Operations job will invoke the language detection Lambda function for each object listed in the manifest file. Lambda function code will perform a preliminary scan to check the file size, file extension, or any other requirements before calling Amazon Comprehend API. The Lambda function will then read the text object from S3 and then call the Amazon Comprehend API to detect the dominant language.
The Language Detection API automatically identifies text written in over 100 languages. The API response contains the dominant language with a confidence score supporting the interpretation. An example API response would be: {‘LanguageCode’: ‘fr’, ‘Score’: 0.9888556003570557}. Once the Lambda function receives the API response, Lambda will return a message back to S3 Batch Operations with a result code.
The Lambda function will then publish a message to an Amazon Simple Notification Service (SNS) topic.
An Amazon Simple Queue Service (SQS) queue subscribed to the SNS topic will receive the message with all required information related to each processed text object.
The SQS queue will invoke a Lambda function to process the message.
The Lambda function will move the targeted S3 object to a destination S3 bucket.
S3 Batch Operations will generate a completion report and will store it in an S3 bucket. The completion report will contain additional information for each task, including the object key name and version, status, error codes, and error descriptions.

Leverage SNS fanout pattern for more complex use cases

This blog post describes the basic building blocks for the solution, but it can be extended for more complex use cases, as illustrated in Figure 2. Using an SNS fanout application integration pattern would enable many SQS queues to subscribe to the same SNS topic. These SQS queues would receive identical notifications for the processed text objects, and you could implement downstream services for additional evaluation. For example, you can store text object metadata in an Amazon DynamoDB table. You can further analyze the number of processed text objects, dominant languages, object size, word count, and more.

Your source S3 bucket may have objects being uploaded in real time in addition to the existing batch processes. In this case, you could process these objects in a new batch job, or process them individually during upload by using S3 event triggers and Lambda.

Figure 2. Extending the solution

Conclusion

You can implement a language detection job in a number of ways. All the Amazon Comprehend single document and synchronous API batch operations can be used for real-time analysis. Asynchronous batch operations can analyze large documents and large collections of documents. However, by using S3 Batch Operations, you can scale language detection batch operations to billions of text objects stored in S3. This solution has the flexibility to add customized functionality. This may be useful for more complex jobs, or when you want to capture different data points from your S3 objects.

For further reading:

Top 5: Featured Architecture Content for September

2021-10-04 Elyse Lopez

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-september/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from the new and newly updated content we released in the past month.

1. AWS Best Practices for DDoS Resiliency

Prioritizing the availability and responsiveness of your application helps you maintain customer trust. That’s why it’s crucial to protect your business from the impact of distributed denial of service (DDoS) and other cyberattacks. This whitepaper provides you prescriptive guidance to improve the resiliency of your applications and best practices for how to manage different attack types.

2. Predictive Modeling for Automotive Retail

Automotive retailers use data to better understand how their incentives are helping to sell cars. This new reference architecture diagram shows you how to design a modeling system that provides granular return on investment (ROI) predictions for automotive sales incentives.

3. AWS Graviton Performance Testing – Tips for Independent Software Vendors

If you’re deciding whether to phase in AWS Graviton processors for your workload, this whitepaper covers best practices and common pitfalls for defining test approaches to evaluate Amazon Elastic Compute Cloud (Amazon EC2) instance performance and how to set success factors and compare different test methods and their implementation.

4. Text Analysis with Amazon OpenSearch Service and Amazon Comprehend

This AWS Solutions Implementation was recently updated with new guidance related to Amazon OpenSearch Service, the successor to Amazon Elasticsearch Service. Learn how Amazon OpenSearch Service and Amazon Comprehend work together to deploy a cost-effective, end-to-end solution to extract meaningful insights from unstructured text-based data such as customer calls, support tickets, and online customer feedback.

5. Back to Basics: Hosting a Static Website on AWS

In this episode of Back to Basics, join SA Readiness Specialist Even Zhang as he breaks down the AWS services you can use to host and scale your static website without a single server. You’ll also learn how to use additional functionalities to enhance your observability and security posture or run A/B tests.

Figure 1. CloudFront Edge Locations and Caches from Back to Basics video

Field Notes: How to Prepare Large Text Files for Processing with Amazon Translate and Amazon Comprehend

2021-09-27 Veeresh Shringari

Post Syndicated from Veeresh Shringari original https://aws.amazon.com/blogs/architecture/field-notes-how-to-prepare-large-text-files-for-processing-with-amazon-translate-and-amazon-comprehend/

Biopharmaceutical manufacturing is a highly regulated industry where deviation documents are used to optimize manufacturing processes. Deviation documents in biopharmaceutical manufacturing processes are geographically diverse, spanning multiple countries and languages. The document corpus is complex, with additional requirements for complete encryption. Therefore, to reduce downtime and increase process efficiency, it is critical to automate the ingestion and understanding of deviation documents. For this workflow, a large biopharma customer needed to translate and classify documents at their manufacturing site.

The customer’s challenge included translation and classification of paragraph-sized text documents into statement types. First, the tokenizer previously used was failing for certain languages. Second, post-tokenization, big paragraphs were needed to be sliced into sizes smaller than 5,000 bytes to facilitate consumption into Amazon Translate and Amazon Comprehend. Because each sentence and paragraphs were of differing sizes, the customer needed to slice them so that each sentence and paragraph did not lose their context and meaning.

This blog post describes a solution to tokenize text documents into appropriate-sized chunks for easy consumption by Amazon Translate and Amazon Comprehend.

Overview of solution

The solution is divided into the following steps. Text data coming from the AWS Glue output is transformed and stored in Amazon Simple Storage Service (Amazon S3) in a .txt file. This transformed data is passed into the sentence tokenizer with slicing and encryption using AWS Key Management Service (AWS KMS). This data is now ready to be fed into Amazon Translate and Amazon Comprehend, and then to a Bidirectional Encoder Representations from Transformers (BERT) model for clustering. All of the models are developed and managed in Amazon SageMaker.

Prerequisites

For this walkthrough, you should have the following prerequisites:

The architecture in Figure 1 shows a complete document classification and clustering workflow running the sentence tokenizer solution (step 4) as an input to Amazon Translate and Amazon Comprehend. The complete architecture also uses AWS Glue crawlers, Amazon Athena, Amazon S3 , AWS KMS, and SageMaker.

Figure 1. Higher level architecture describing use of the tokenizer in the system

Solution steps

Ingest the streaming data from the daily pharma supply chain incidents from the AWS Glue crawlers and Athena-based view tables. AWS Glue is used for ETL (extract, transform, and load), while Athena helps to analyze the data in Amazon S3 for its integrity.
Ingest the streaming data into Amazon S3, which is AWS KMS encrypted. This limits any unauthorized access to the secured files, as required for the healthcare domain.
Enable the CloudWatch logs. CloudWatch logs help to store, monitor, and access error messages logged by SageMaker.
Open the SageMaker notebook using AWS console, and navigate to the integrated development environment (IDE) with Python notebook.

Solution description

Initialize the Amazon S3 client, and enable the get_execution role.

Figure 2. Code sample to initialize Amazon S3 Client execution roles

Figure 3 shows the code for tokenizing large paragraphs into sentences. This helps to feed a sentence of 5,000 byte chunks to Amazon Translate and Amazon Comprehend. Additionally, in the regulated environment, data at rest and in transition, is encrypted using AWS KMS (using S3 IO object) before chunking into 5,000-byte size files using last-in-first-out (LIFO) process.

Figure 3. Code sample with file chunking function and AWS KMS encryption

Figure 4 shows the function for writing the file chunks to objects in Amazon S3, and objects are AWS KMS encrypted.

Figure 4. Code sample for writing chunked 5,000-byte sized data to Amazon S3

Code sample

The following example code details the tokenizer and chunking tool which we subsequently run through SageMaker:
https://gitlab.aws.dev/shringv/aws-samples-aws-tokenizer-slicing-file

Cleaning up

To avoid incurring future charges, delete the resources (like S3 objects) used for the practice files after you have completed implementation of the solution.

Conclusion

In this blog post, we presented a solution which incorporates sentence-level tokenization with rules governing expected sentence size. The solution includes automation scripts to reduce bigger files into smaller chunked sizes of 5,000 bytes to facilitate Amazon Translate and Amazon Comprehend. The solution is effective for tokenizing and chunking complex environments with multi-language files. Furthermore, the solution uses file exchange security by using AWS KMS, as required by regulated industries.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Simplify data discovery for business users by adding data descriptions in the AWS Glue Data Catalog

2021-08-23 Karim Hammouda

Post Syndicated from Karim Hammouda original https://aws.amazon.com/blogs/big-data/simplify-data-discovery-for-business-users-by-adding-data-descriptions-in-the-aws-glue-data-catalog/

In this post, we discuss how to use AWS Glue Data Catalog to simplify the process for adding data descriptions and allows data analysts to access, search, and discover this cataloged metadata with BI tools.

In this solution, we use AWS Glue Data Catalog, to break the silos between cross-functional data producer teams, sometimes also known as domain data experts, and business-focused consumer teams that author business intelligence (BI) reports and dashboards.

Since you’re reading this post, you may also be interested in the following:

Build a DataOps platform to break the silos between data engineers and data analysts around running business logic for data processing

Data democratization and the need for self-service BI

To be able to extract insights and get value out of organizational-wide data assets, data consumers like data analysts need to understand the meaning of existing data assets. They rely on data platform engineers to perform such data discovery tasks on their behalf.

Although data platform engineers can programmatically extract and obtain some technical and operational metadata, such as database and table names and sizes, column schemas, and keys, this metadata is primarily used for organizing and manipulating data inside the data lake. They still rely on source data domain experts to gain more knowledge about the meaning of the data, its business context, and classification. It becomes more challenging when data domain experts tend to prioritize operational-critical requests and delay the analytical-related ones.

Such a cycled dependency, as illustrated in the following figure, can delay the organizational strategic vision for implementing a self-service data analytics platform to reduce the time of the data-to-insights process.

Solution overview

The Data Catalog fundamentally holds basic information about the actual data stored in various data sources, including but not limited to Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. Information like data location, format, and columns schema can be automatically discovered and stored as tables, where each table specifies a single data store.

Throughout this post, we see how we can use the Data Catalog to make it easy for domain experts to add data descriptions, and for data analysts to access this metadata with BI tools.

First, we use the comment field in Data Catalog schema tables to store data descriptions with more business meaning. Comment fields aren’t like the other schema table fields (such as column name, data type, and partition key), which are typically populated automatically by an AWS Glue crawler.

We also use Amazon AI/ML capabilities to initially identify the description of each data entity. One way to do that is by using the Amazon Comprehend text analysis API. When we provide a sample of values for each data entity type, Amazon Comprehend natural language processing (NLP) models can identify a standard range of data classification, and we can use this as a description for identified data entities.

Next, because we need to identify entities unique to our domain or organization, we can use custom named entity recognition (NER) in Amazon Comprehend to add more metadata that is related to our business domain. One way to train custom NER models is to use Amazon SageMaker Ground Truth; for more information, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

For this post, we use a dataset that has a table schema defined as per TPC-DS, and was generated using a data generator developed as part of AWS Analytics Reference Architecture code samples.

In this example, Amazon Comprehend API recognizes PII-related fields like Aid as a MAC address. While the none PII-related fields like Estatus, aren’t recognized. Therefore, the user enters a custom description manually, and we use the custom NER to automatically populate those fields, as shown in the following diagram.

After we add data meanings, we need to expose all the metadata captured in the Data Catalog to various data consumers. This can be done two different ways:

Via the AWS Glue Data Catalog API, which is more suitable for technical users comprehending ETL jobs programmatically using services like AWS Glue ETL or Amazon EMR
Via SQL-querying to information_schema for data analysts performing ad hoc SQL querying using services like Amazon Athena

We can also use the latter method to expose the Data Catalog to BI authors comprehending data analyses and dashboards using Amazon QuickSight, so we use the second method for this post.

We do this by defining an Athena dataset that queries the information_schema and allows BI authors to use the QuickSight capability of text search filter to search and discover data using its business meaning (see the following diagram).

Solution details

The core part of this solution is done using AWS Glue jobs. We use two AWS Glue jobs, which are responsible for calling Amazon Comprehend APIs and updating the AWS Glue Data Catalog with added data descriptions accordingly.

The first job (Glue_Comprehend_Job) performs the first stage of detection using the Amazon Comprehend Detect PII API, and the second job (Glue_Comprehend_Custom) uses Amazon Comprehend custom entity recognition for entities labeled by domain experts. The following diagram illustrates this workflow.

We describe the details of each stage in the upcoming sections.

You can integrate this workflow into your existing data processing pipeline, which might be orchestrated with AWS services like AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue workflows, or any third-party orchestrator.

The workflow can complement AWS Glue crawler functionality and inherit the same logic for scheduling and running crawlers. On the other end, we can query the updated Data Catalog with data descriptions via Athena (see the following diagram).

To show an end-to-end implementation of this solution, we have adopted a choreographically built architecture with additional AWS Lambda helper functions, which communicate between AWS services, triggering the AWS Glue crawler and AWS Glue jobs.

Stage-one: Enrich the Data Catalog with a standard built-in Amazon Comprehend entity detector

To get started, Choose to launch a CloudFormation stack.

Define unique S3 bucket name and on the CloudFormation console, accept default values for the parameters.

This CloudFormation stack consists of the following:

An AWS Identity Access Management (IAM) role called Lambda-S3-Glue-comprehend.
An S3 bucket with a bucket name that can be defined based on preference.
A Lambda function called trigger_data_cataloging. This function is automatically triggered when any CSV file is uploaded to the folder row_data inside our S3 bucket. Then it creates an AWS Glue database if one doesn’t exist, and creates and runs an AWS Glue crawler called glue_crawler_comprehend.
An AWS Glue job called Glue_Comprehend_Job, which calls Amazon Comprehend APIs and updates the AWS Glue Data Catalog table accordingly.
A Lambda function called Glue_comprehend_workflow, which is triggered when the AWS Glue Crawler successfully finishes and calls the AWS Glue job Glue_Comprehend_Job.

To test the solution, create a prefix called row_data under the S3 bucket created from the CF stack, then upload the customer dataset sample to the prefix.

The first Lambda function is triggered to run the subsequent AWS Glue crawler and AWS Glue job to get data descriptions using Amazon Comprehend, and it updates the comment section of the dataset created in the AWS Glue Data Catalog.

Stage-two: Use Amazon Comprehend custom entity recognition

Amazon Comprehend was able to detect some of the entity types within the customer sample dataset. However, for the remaining undetected fields, we can get help from a domain data expert to label a sample dataset using Ground Truth. Then we use the labeled data output to train a custom NER model and rerun the AWS Glue job to update the comment column with a customized data description.

Train an Amazon Comprehend custom entity recognition model

One way to train Amazon Comprehend custom entity recognizers is to get augmented manifest information using Ground Truth to label the data. Ground Truth has a built-in NER task for creating labeling jobs so domain experts can identify entities in text. To learn more about how to create the job, see Named Entity Recognition.

As an example, we tagged three labels entities: customer information ID, current level of education, and customer credit rating. The domain experts get a web interface like one shown in the following screenshot to label the dataset.

We can use the output of the labeling job to train an Amazon Comprehend custom entity recognition model using the augmented manifest.

The augmented manifest option requires a minimum of 1,000 custom entity recognition samples. Another option can be to use a CSV file that contains the annotations of the entity lists for the training dataset. The required format depends on the type of CSV file that we provide. In this post, we use the CSV entity lists option with two sample files:

The entity list labels_entity_list .csv
The sample dataset training_dataset.csv

To create the training job, we can use the Amazon Comprehend console, the AWS Command Line Interface (AWS CLI), or the Amazon Comprehend API. For this post, we use the API to programmatically create a training Lambda function using the AWS SDK for Python, as shown on GitHub.

The training process can take approximately 15 minutes. When the training process is complete, choose the recognizer and make a note of the recognizer ARN, which we use in the next step.

Run custom entity recognition inference

When the training job is complete, create an Amazon Comprehend analysis job using the console or APIs as shown on GitHub.

The process takes approximately 10 minutes, and again we need to make a note of the output job file.

Create an AWS Glue job to update the Data Catalog

Now that we have the Amazon Comprehend inference output, we can use the following AWS CLI command to create an AWS Glue job that updates the Data Catalog Comment fields for this dataset with customized data description.

Download the AWS Glue job script from the GitHub repo, upload to the S3 bucket created from the CF Stack in stage-1, and run the following AWS CLI command:

aws glue create-job 
--name "Glue_Comprehend_Job_custom_entity" 
--role "Lambda-S3-Glue-comprehend" 
--command '{"Name" : "pythonshell", "ScriptLocation" : "s3://<Your S3 bucket>/glue_comprehend_workflow_custom.py","PythonVersion":"3"}'
--default-arguments '{"--extra-py-files": "s3://aws-bigdata-blog/artifacts/simplify-data-discovery-for-business-users/blog/python/library/boto3-1.17.70-py2.py3-none-any.whl" }'

After you create the AWS Glue job, edit the job script and update the bucket and key name variables with the output data location of the Amazon Comprehend analysis jobs and run the AWS Glue job. See the following code:

bucket ="<Bucket Name>"
key = "comprehend_output/<Random number>output/output.tar.gz"

When the job is complete, it updates the Data Catalog with customized data descriptions.

Expose Data Catalog data to data consumers for search and discovery

Data consumers that prefer using SQL can use Athena to run queries against the information_schema.columns table, which includes the comment field of the Data Catalog. See the following code:

SELECT table_catalog,
         table_schema,
         table_name,
         column_name,
         data_type,
         comment
FROM information_schema.columns
WHERE comment LIKE '%customer%'
AND table_name = 'row_data_row_data'

The following screenshot shows our query results.

The query searches all schema columns that might have any data meanings that contain customer; it returns crating, which contains customer in the comment field.

BI authors can use text search instead of SQL to search for data meanings of data stored in an S3 data lake. This can be done by setting up a visual layer on top of Athena inside QuickSight.

QuickSight is scalable, serverless, embeddable, and machine learning (ML) powered BI tool that is deeply integrated with other AWS services.

BI development in QuickSight is organized as a stack of datasets, analyses, and dashboards. We start by defining a dataset from a list of various integrated data sources. On top of this dataset, we can design multiple analyses to uncover hidden insights and trends in the dataset. Finally, we can publish these analyses as dashboards, which is the consumable form that can be shared and viewed across different business lines and stakeholders.

We want to help the BI authors while designing analyses to get a better knowledge of the datasets they’re working on. To do so, we first need to connect to the data source where the metadata is stored, in this case the Athena table information_schema.columns, so we create a dataset to act as a Data Catalog view inside QuickSight.

QuickSight offers different modes of querying data sources, which is decided as part of the dataset creation process. The first mode is called direct query, in which the fetching query runs directly against the external data source. The second mode is a caching layer called QuickSight Super-fast Parallel In-memory Calculation Engine (SPICE), which improves performance when data is shared and retrieved by various BI authors. In this mode, the data is stored locally and can be reused multiple times, instead of running queries against the data source every time the data needs to be retrieved. However, as with all caching solutions, you must take data volume limits into consideration while choosing datasets to be stored in SPICE.

In our case, we choose to keep the Data Catalog dataset in SPICE, because the volume of the dataset is relatively small and won’t consume a lot of SPICE resources. However, we need to decide if we want to refresh the data cached in SPICE. The answer depends on how frequently the data schema and Data Catalog change, but in any case we can use the built-in scheduling within QuickSight to refresh SPICE at the desired interval. For information about triggering a refresh in an event-based manner, see Event-driven refresh of SPICE datasets in Amazon QuickSight.

After we create the Data Catalog view as a dataset inside QuickSight stored in SPICE, we can use row-level security to restrict the access to this dataset. Each BI author has access with respect to their privileges for columns they can view metadata for.

Next, we see how we can allow BI authors to search through data descriptions populated in the comment field of the Data Catalog dataset. QuickSight offers features like filters, parameters, and controls to add more flexibility into QuickSight analyses and dashboards.

Finally, we use the QuickSight capability to add more than one dataset within an analysis view to allow BI authors to switch between the metadata for the dataset and the actual dataset. This allows the BI authors to self-serve, reducing dependency on data platform engineers to decide which columns they should use in their analyses.

To set up a simple Data Catalog search and discovery inside QuickSight, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
For New data sources, choose Amazon Athena.
Name the dataset Data Catalog.
Choose Create data source.
For Choose your table, choose Use custom SQL.
For Enter custom SQL query, name the query Data Catalog Query.
Enter the following query:

SELECT * FROM information_schema.columns

Choose Confirm query.
Select Import to QuickSight SPICE for quicker analytics.
Choose Visualize.

Next, we design an analysis on the dataset we just created to access the Data Catalog through Athena.

When we choose Visualize, we’re redirected to the QuickSight workspace to start designing our analysis.

Under Visual types, choose Table.
Under Fields list, add table_name, column_name, and comment to the Values field well.

Next, we use the filter control feature to allow users to perform text search for data descriptions.

In the navigation pane, choose Filter.
Choose the plus sign (+) to access the Create a new filter list.
On the list of columns, choose comment to be the filter column.
From the options menu (…) on the filter, choose Add to sheet.

We should be able to see a new control being added into our analysis to allow users to search the comment field.

Now we can start a text search for data descriptions that contain customer, where QuickSight shows the list of fields matching the search criteria and provides table and column names accordingly.

Alternatively, we can use parameters to be associated with the filter control if needed, for example to connect one dashboard to another. For more information, see the GitHub repo.

Finally, BI authors can switch between the metadata view that we just created and the actual Athena table view (row_all_row_data), assuming it’s already imported (if not, we can use the same steps from earlier to import the new dataset).

In the navigation pane, choose Visualize.
Choose the pen icon to add, edit, replace, or remove datasets.
Choose Add dataset.
Add row_all_row_data.
Choose Select.

BI authors can now switch between data and metadata datasets.

They now have a metadata view along with the actual data view, so they can better understand the meaning of each column in the dataset they’re working on, and they can read any comment that can be passed from other teams within the organization without needing to do this manually.

Conclusion

In this post, we showed how to build a quick workflow using AWS Glue and Amazon AI/ML services to complement the AWS Glue crawler functionality. You can integrate this workflow into a typical AWS Glue data cataloging and processing pipeline to achieve alignment between cross-functional teams by simplifying and automating the process of adding data descriptions in the Data Catalog. This is an important step in data discovery, and the topic will be covered more in upcoming posts.

This solution is also a step towards implementing data privacy and protection regimes such as the Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR) by identifying sensitive data types like PII and enforcing access polices.

You can find the source code from this post on GitHub and use it to build your own solution. For more information about NER models, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

About the Authors

Karim Hammouda is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.

Ahmed Raafat is a Senior Solutions Architect at Amazon Web Services, with a passion for machine learning solutions. Ahmed acts as a trusted advisor for many AWS enterprise customers to support and accelerate their cloud journey.

Automate Document Processing in Logistics using AI

2021-08-19 Manikanth Pasumarti

Post Syndicated from Manikanth Pasumarti original https://aws.amazon.com/blogs/architecture/automate-document-processing-in-logistics-using-ai/

Multi-modal transportation is one of the biggest developments in the logistics industry. There has been a successful collaboration across different transportation partners in supply chain freight forwarding for many decades. But there’s still a considerable overhead of paperwork processing for each leg of the trip. Tens of billions of documents are processed in ocean freight forwarding alone. Using manual labor to process these documents (purchase orders, invoices, bills of lading, delivery receipts, and more) is both expensive and error-prone.

In this blog post, we’ll address how to automate the document processing in the logistics industry. We’ll also show you how to integrate it with a centralized workflow management.

Automated document processing architecture

Figure 1. Architecture of document processing workflow

The solution workflow shown in Figure 1 is as follows:

Documents that belong to the same transaction are collected in an S3 bucket
The document processing workflow is initiated
The workflow orchestration is as follows:
- Document is processed via automation
- Relevant entities are extracted
- Extracted data is reviewed
- Order data is consolidated

This architecture uses Amazon Simple Storage Service (S3) for document storage, and Amazon Simple Queue Service (SQS) for workflow initiation. Amazon Textract is used for text extraction, Amazon Comprehend for entity extraction, and Amazon Augmented AI (A2I) for human review. This will ensure correct results in cases of low confidence predictions.

We use AWS Step Functions for the orchestration of document processing workflow. Step functions also help to improve the application resiliency with less code.

AWS Lambda functions are used to:

Detect if all required documents for a given transaction are available in Amazon S3
Kick off the process by creating an Amazon SQS message
Detect a new processing job from a generated SQS message
Extract text from PDFs using a Step Function
Extract entities from generated text using a Step Function
Control data completeness and accuracy
Initiate a human loop when needed using a Step Function
Consolidate the data collected from documents
Store the data into the database

Document ingestion and classification

There are several data ingestion options available such as AWS Transfer Family, AWS DataSync, and Amazon Kinesis Data Firehose. Choose the appropriate ingestion blueprints based on the type of data sources. Typical real-time ingestion blueprints include AWS Lambda processing and an Amazon CloudWatch event. The batch pipeline can leverage AWS Step Functions. This can be used to orchestrate the Lambda function that initiates the document processing workflow.

Here are some things to consider when building your document ingestion and storage solution:

Choose your bucket strategy. Amazon S3 is an object store. Analyze your data pipeline ingestion carefully and choose the correct S3 bucket strategy for each document type (bills, supplier invoices, and others.)
Organize your data. The data is organized in S3 buckets by layers: Raw, Staging, and Processed. Each has their own respective bucket policy and access control.
Build a creation tool. This is an automated data lake bucket/folder structure tool, based on your data ingestion requirements. You can use this same structure for user-created data.
Define data security requirements. Do this before you begin the ingestion process. Before ingesting new or current data sources into AWS, secure access to the data.
Review security credentials needed for access. After copying these credentials into AWS Systems Manager (SSM), apply an AWS Key Management Service (KMS) key to encrypt the file. This encrypted key string is stored in SSM to use for authentication.

Document processing workflow

Overview

The workflow checks the input buckets until it detects all the documents types necessary for a complete dataset. In our case, it is the invoice document and customs authorization form. Once both are detected, it generates a job request as a message in Amazon SQS. A Lambda function then processes the message and kicks off the Step Function flow (see Figure 2). The state machine then initiates the document processing, text extraction, and optional human review steps. AWS Step Functions are well suited for our use case due to its ability to manage long-running workflows.

Figure 2. Visual workflow of document processing in AWS Step Functions

Entity extraction

For each document, entities are extracted using Amazon Textract and Amazon Comprehend. These entities can include date, company, address, bill of materials, total cost, and invoice number.

Following is a sample invoice document that is fed to Amazon Textract, which extracts the form data and creates key-value pairs.

Figure 3. Highlighted different entities in the sample invoice document

See Figure 4 for an example of the key-value pairs extracted for the sample invoice. The keys here represent the form labels (“SHIP TO”) and the values represent form values (shipping address).

Figure 4. Key-value pairs of the invoice data, extracted by Amazon Textract

Amazon Textract also generates a raw text output that contains the entire text, as shown in Figure 5 following.

Figure 5. Raw text output of the invoice data extracted by Amazon Textract

To achieve a higher degree of confidence, Amazon Comprehend is used to identify and extract the custom entities. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to identify and extracts insights and entities from text data. You can train Amazon Comprehend to identify entities relevant to your organization. These can be product names, part numbers, department names, or other entities. You can also train Amazon Comprehend to categorize documents or assign relevant labels to text.

An Amazon Comprehend entity recognizer comes with a set of pre-built entity types. Amazon Comprehend can introduce custom entities to match our specific business needs. Some of the entities we want to identify are address and company name. We trained a custom recognizer to detect company names and addresses, see Figure 6.

Figure 6. Training details of custom entity recognizer

Figure 7 shows the resulting output from Amazon Comprehend:

Figure 7. Amazon Comprehend entity recognition output

The document is processed top-down, from left to right, from the sample invoice in Figure 3. We know that the first company and first address belongs to the Billing Company. And the second set belongs to the Shipment recipient. Along with detecting custom entities, Amazon Comprehend also outputs the confidence score of the extracted result.

Confidence scores can vary depending on how close training data is to actual data. In the example preceding, the first company entity came back with a score of 0.941. Let’s assume that we have set a minimum confidence score of 0.95. Anything below that threshold should be reviewed by a human. The following section describes the last step of our workflow.

Human review

Amazon Augmented AI (A2I) allows you to create and manage human loops. A human loop is a manual review task that gets assigned to a workforce. The workforce can be public, such as Mechanical Turk, or private, such as internal team or a paid contractor. In our example, we created a private workforce to review the entities we were not confident about. Figure 8 shows an example of the user interface that the reviewers use to assign entities to the proper text sections.

Figure 8. Manual review interface of Amazon A2I

Review tasks can be automatically submitted to the workforce based on dynamic criteria, after both AI-related steps are completed. It can be used to review the text detected by Amazon Textract when key data elements are missing (such as order amount or quantity). It can also review entities after invoking Amazon Comprehend.

Figure 9. Consolidated dataset of processed invoice and customs authorization data

After the manual review step, data can be consolidated (as shown in Figure 9) and stored into a relational database. It can also be shared with other business units such as Accounting or Customer Services. You can apply the same process to other document types such as custom forms, which are linked to the same transaction. This allows us to process and combine information that comes from disparate paper sources more efficiently.

Conclusion

This post demonstrates how document processing can be automated to process business documentation by using Amazon Textract, Amazon Comprehend and Amazon Augmented AI.

Deploying an automated solution in the logistics industry takes away the undifferentiated heavy lifting involved in manual document processing. This helps to cut down the delivery delays and track any missed deliveries. By providing a comprehensive view of the shipment, it increases the efficiency of back-office processing. It can also further simplify the data collection for audit purposes.

To learn more:

Benefits of Modernizing On-premise Analytics with an AWS Lake House

2021-07-30 Vikas Nambiar

Post Syndicated from Vikas Nambiar original https://aws.amazon.com/blogs/architecture/benefits-of-modernizing-on-premise-analytics-with-an-aws-lake-house/

Organizational analytics systems have shifted from running in the background of IT systems to being critical to an organization’s health.

Analytics systems help businesses make better decisions, but they tend to be complex and are often not agile enough to scale quickly. To help with this, customers upgrade their traditional on-premises online analytic processing (OLAP) databases to hyper converged infrastructure (HCI) solutions. However, these systems incur operational overhead, are limited by proprietary formats, have limited elasticity, and tie customers into costly and inhibiting licensing agreements. These all bind an organization’s growth to the growth of the appliance provider.

In this post, we provide you a reference architecture and show you how an AWS lake house will help you overcome the aforementioned limitations. Our solution provides you the ability to scale, integrate with multiple sources, improve business agility, and help future proof your analytics investment.

High-level architecture for implementing an AWS lake house

Lake house architecture uses a ring of purpose-built data consumers and services centered around a data lake. This approach acknowledges that a one-size-fits-all approach to analytics eventually leads to compromises. These compromises can include agility associated with change management and impact of different business domain reporting requirements on the data from a central platform. As such, simply integrating a data lake with a data warehouse is not sufficient.

Each step in Figure 1 needs to be de-coupled to build a lake house.

Figure 1. Data flow in a lake house

Figure 2. High-level design for an AWS lake house implementation

Building a lake house on AWS

These steps summarize building a lake house on AWS:

Identify source system extraction capabilities to define an ingestion layer that loads data into a data lake.
Build data ingestion layer using services that support source systems extraction capabilities.
Build a governance and transformation layer to manipulate data.
Provide capability to consume and visualize information via purpose-built consumption/value layer.

This lake house architecture provides you a de-coupled architecture. Services can be added, removed, and updated independently when new data sources are identified like data sources to enrich data via AWS Data Exchange. This can happen while services in the purpose-built consumption layer address individual business unit requirements.

Building the data ingestion layer

Services in this layer work directly with the source systems based on their supported data extraction patterns. Data is then placed into a data lake.

Figure 3 shows the following services to be included in this layer:

AWS Transfer Family for SFTP integrates with source systems to extract data using secure shell (SSH), SFTP, and FTPS/FTP. This service is for systems that support batch transfer modes and have no real-time requirements, such as external data entities.
AWS Glue connects to real-time data streams to extract, load, transform, clean, and enrich data.
AWS Database Migration Service (AWS DMS) connects and migrates data from relational databases, data warehouses, and NoSQL databases.

Figure 3. Ingestion layer against source systems

Services in this layer are managed services that provide operational excellence by removing patching and upgrade overheads. Being managed services, they will also detect extraction spikes and scale automatically or on-demand based on your specifications.

Building the data lake layer

A data lake built on Amazon Simple Storage Service (Amazon S3) provides the ideal target layer to store, process, and cycle data over time. As the central aspect of the architecture, Amazon S3 allows the data lake to hold multiple data formats and datasets. It can also be integrated with most if not all AWS services and third-party applications.

Figure 4 shows the following services to be included in this layer:

Amazon S3 acts as the data lake to hold multiple data formats.
Amazon S3 Glacier provides the data archiving and long-term backup storage layer for processed data. It also reduces the amount of data indexed by transformation layer services.

Figure 4. Data lake integrated to ingestion layer

The data lake layer provides 99.999999999% data durability and supports various data formats, allowing you to future proof the data lake. Data lakes on Amazon S3 also integrate with other AWS ecosystem services (for example, AWS Athena for interactive querying or third-party tools running off Amazon Elastic Compute Cloud (Amazon EC2) instances).

Defining the governance and transformation layer

Services in this layer transform raw data in the data lake to a business consumable format, along with providing operational monitoring and governance capabilities.

Figure 5 shows the following services to be included in this layer:

AWS Glue discovers and transforms data, making it available for search and querying.
Amazon Redshift (Transient) functions as an extract, transform, and load (ETL) node using RA3 nodes. RA3 nodes can be paused outside ETL windows. Once paused, Amazon Redshift’s data sharing capability allows for live data sharing for read purposes, which reduces costs to customers. It also allows for creation of separate, smaller read-intensive business intelligence (BI) instances from the larger write-intensive ETL instances required during ETL runs.
Amazon CloudWatch monitors and observes your enabled services. It integrates with existing IT service management and change management systems such as ServiceNow for alerting and monitoring.
AWS Security Hub implements a single security pane by aggregating, organizing, and prioritizing security alerts from services used, such as Amazon GuardDuty, Amazon Inspector, Amazon Macie, AWS Identity and Access Management (IAM) Access Analyzer, AWS Systems Manager, and AWS Firewall Manager.
Amazon Managed Workflows for Apache Airflow (MWAA) sequences your workflow events to ingest, transform, and load data.
Amazon Lake Formation standardizes data lake provisioning.
AWS Lambda runs custom transformation jobs if required, or developed over a period of time that hold custom business logic IP.

Figure 5. Governance and transformation layer prepares data in the lake

This layer provides operational isolation wherein least privilege access control can be implemented to keep operational staff separate from the core services. It also lets you implement custom transformation tasks using Lambda. This allows you to consistently build lakes across all environments and single view of security via AWS Security Hub.

Building the value layer

This layer generates value for your business by provisioning decoupled, purpose-built visualization services, which decouples business units from change management impacts of other units.

Figure 6 shows the following services to be included in this value layer:

Amazon Redshift (BI cluster) acts as the final store for data processed by the governance and transformation layer.
Amazon Elasticsearch Service (Amazon ES) conducts log analytics and provides real-time application and clickstream analysis, including for data from previous layers.
Amazon SageMaker prepares, builds, trains, and deploys machine learning models that provide businesses insights on possible scenarios such as predictive maintenance, churn predictions, demand forecasting, etc.
Amazon QuickSight acts as the visualization layer, allowing business and support resources users to create reports, dashboards accessible across devices and embedded into other business applications, portals, and websites.

Figure 6. Value layer with services for purpose-built consumption

Conclusion

By using services managed by AWS as a starting point, you can build a data lake house on AWS. This open standard based, pay-as-you-go data lake will help future proof your analytics platform. With the AWS data lake house architecture provided in this post, you can expand your architecture, avoid excessive license costs associated with proprietary software and infrastructure (along with their ongoing support costs). These capabilities are typically unavailable in on-premises OLAP/HCI based data analytics platforms.

Related information

Field Notes: Building a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend

2021-07-28 Zahi Ben Shabat

Post Syndicated from Zahi Ben Shabat original https://aws.amazon.com/blogs/architecture/field-notes-building-a-scalable-real-time-newsfeed-watchlist-using-amazon-comprehend/

One of the challenges businesses have is to constantly monitor information via media outlets and be alerted when a key interest is picked up, such as individual, product, or company information. One way to do this is to scan media and news feeds against a company watchlist. The list may contain personal names, organizations or suborganizations of interest, and other type of entities (for example, company products). There are several reasons why a company might need to develop such a process: reputation risk mitigation, data leaks, competitor influence, and market change awareness.

In this post, I will share with you a prototype solution that combines several AWS Services: Amazon Comprehend, Amazon API Gateway, AWS Lambda, and Amazon Aurora Serverless. We will examine the solution architecture and I will provide you with the steps to create the solution in your AWS environment.

Overview of solution

Figure 1 – Architecture showing how to build a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend

Walkthrough

The preceding architecture shows an Event-driven design. To interact with the solution we use API Lambda functions which are initiated upon user request. Following are the high-level steps.

Create a watchlist. Use the “Refresh_watchlist” API to submit new data, or load existing data from a CSV file located in a bucket. More details in the following “Loading Data” section.
Make sure the data is loaded properly and check a known keyword. Use the “check-keyword” API. More details in the following “Testing” section.
Once the watchlist data is ready, submit a request to “query_newsfeed” with a given newsfeed configuration (url, document section qualifier) to submit a new job to scan the content against the watchlist. Review the example in the “Testing” section.
If an entity or keyword matched, you will get a notification email with the match results.

Technical Walkthrough

When a new request to “query_newsfeed” is submitted. The Lambda handler extracts the content of the URL and creates a new message in the ‘Incoming Queue’.
Once there are available messages in the incoming queue, a subscribed Lambda function is invoked “evaluate content”. Thistakes the scraped content from the message and submits it to Amazon Comprehend to extract the desired elements (entities, key phrase, sentiment).
The result of Amazon Comprehend is passed through a matching logic component, which runs the results against the watchlist (Aurora Serverless Postgres DB), utilizing Fuzzy Name matching.
If a match occurs, a new message is generated for Amazon SNS which initiates a notification email.

To deploy and test the solution we follow four steps:

Create infrastructure
Create serverless API Layer
Load Watchlist data
Test the match

The code for Building a Scalable real-time newsfeed watchlist is available in this repository.

Prerequisites

You will need an AWS account and a Serverless Framework CLI Installed.

Security best practices

Before setting up your environment, review the following best practices, and if required change the source code to comply with your security standards.

Deploy the API and Lambda in your private VPC (learn more about deploying serverless in a VPC)
Protect your API Gateway using a WAF
Check AWS Managed Rules for AWS WAF
Check serverless plugin for WAF association
Review Security Best Practices for Amazon Simple Storage Service (Amazon S3)
Review Security Best Practices for AWS Key Management Service
Learn how to approve Amazon Comprehend at FSI Service Spotlight

Creating the infrastructure

I recommend reviewing these resources to get started with: Amazon Aurora Serverless, Lambda, Amazon Comprehend, and Amazon S3.

To begin the procedure:

Clone the GitHub repository to your local drive.
Navigate to “infrastructure” directory.
Use CDK or AWS CLI to deploy the stack:

aws cloudformation deploy --template RealtimeNewsAnalysisStack.template.json --stack-name RealtimeNewsAnalysis --parameter-overrides [email protected]
cdk synth / deploy –-parameters [email protected]

4. Navigate back to the root directory and into the “serverless” directory.

5. Initiate the following serverless deployment commands:

sls plugin install -n serverless-python-requirements 
sls deploy

6. Load data to the watchlist using a standard web service call to the “refresh_watchlist” API.

7. Test the service by calling the web service “check-keyword”.

8. Use “query_newsfeed” Web service to scan newsfeed and articles against the watchlist.

9. Check your mailbox for match notifications.

10. For cleanup and removal of the solution, review the “clean up” section at the end of this post.

The following screenshot shows best practices for images.

Figure 2 – Screenshot showing image best practices

Loading the watchlist data

We can use the refresh watchlist API to recreate the list with the list provided in the message. Use a tool like Postman to send a POST web service call to the refresh_watchlist.

Set the message body to RAW – JSON:

{
"refresh_list_from_bucket": false,
    "watchlist": [
        {"entity":"Mateo Jackson", "entity_type": "person"},
        {"entity":"AnyCompany", "entity_type": "organization"},
        {"entity":"Example product", "entity_type": "product"},
        {"entity":"Alice", "entity_type": "person"},
        {"entity":"Li Juan", "entity_type": "person"}
    ] }

It is possible to use a CSV file to load the data into the watchlist. Locate your newsfeed bucket and upload a CSV file “watchlist.csv” (no header required) under a directory “watchlist” in the newsfeed bucket (create the directory).

CSV Example:

CSV example table

The following is a screenshot showing how Postman initiates the request.

Figure 3 – Screenshot showing Postman initiate the request

Testing

You can use the dedicated check keyword API to test against a list of keywords to see if the match works. This does not utilize Amazon Comprehend, but it can verify that the list is loaded properly and match against a given criterion.

Figure 4 – You can use the dedicated check keyword API to test against a list of keywords

Note: the spelling mistake for alise with an “s” instead of “c”, and, the pronunciation of Li is spelled as Lee. Both returned as a match.

Now, let’s test it with a related news article.

Figure 5 – Screenshot showing test with a related news article

Check your mailbox! You should receive an email with the match result.

Cleaning up

Use cloudformation/cdk for clean up. Also, use serverless clean up `sls remove`.

Conclusion

In this post, you learned how to create a scalable watchlist and use it to monitor newsfeed content. This is a practical demonstration for a typical customer problem. The algorithms Levenshtein distance and soundex, along with Amazon Comprehend built-in machine learning capabilities, provides a powerful method to process and analyze text. To support a high volume of queries, the solution uses Amazon SQS to process messages and Amazon Aurora Serverless to automatically scale the database as needed. It is possible to use the same queue for additional data source ingestion.

This solution can be modified for additional purposes such as financial institutions OFAC watchlist (Work in progress) or other monitoring applications. Feel free to provide feedback and tell us how this solution can be useful for you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

References

Developer Guide: Amazon Comprehend

Amazon Aurora Serverless

Amazon Simple Queue Service

PostgreSQL Documentation

Get started with Serverless Framework Open Source & AWS

Integrating Redaction of FinServ Data into a Machine Learning Pipeline

2021-07-26 Ravikant Gupta

Post Syndicated from Ravikant Gupta original https://aws.amazon.com/blogs/architecture/integrating-redaction-of-finserv-data-into-a-machine-learning-pipeline/

Financial companies process hundreds of thousands of documents every day. These include loan and mortgage statements that contain large amounts of confidential customer information.

Data privacy requires that sensitive data be redacted to protect the customer and the institution. Redacting digital and physical documents is time-consuming and labor-intensive. The accidental or inadvertent release of personal information can be devastating for the customer and the institution. Having automated processes in place reduces the likelihood of a data breach.

In this post, we discuss how to automatically redact personally identifiable information (PII) data fields from your financial services (FinServ) data through machine learning (ML) capabilities of Amazon Comprehend and Amazon Athena. This will ensure you comply with federal regulations and meet customer expectations.

Protecting data and complying with regulations

Protecting PII is crucial to complying with regulations like the California Consumer Privacy Act (CCPA), Europe’s General Data Protection Regulation (GDPR), and Payment Card Industry’s data security standards (PCI DSS).

In Figure 1, we show how structured and non-structured sensitive data stored in AWS data stores can be redacted before it is made available to data engineers and data scientists for feature engineering and building ML models in compliance with organizations data security policies.

Figure 1. How to redact confidential information in your ML pipeline

Architecture walkthrough

This section explains each step presented in Figure 1 and the AWS services used:

By using services like AWS DataSync, AWS Storage Gateway, and AWS Transfer Family, data can be ingested into AWS using batch or streaming pattern. This data lands in an Amazon Simple Storage Service (Amazon S3) bucket, we call this “raw data” in Figure 1.
To detect if the raw data bucket has any sensitive data, use Amazon Macie. Macie is a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect your sensitive data in AWS. When Macie discovers sensitive data, you can configure it to tag the objects with an Amazon S3 object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline. Refer to the Use Macie to discover sensitive data as part of automated data pipelines blog post for detailed instruction on building such pipeline.
This tagged data lands in a “scanned data” bucket, where we use Amazon Comprehend, a natural language processing (NLP) service that uses ML to uncover information in unstructured data. Amazon Comprehend works for unstructured text document data and redacts sensitive fields like credit card numbers, date of birth, social security number, passport number, and more. Refer to the Detecting and redacting PII using Amazon Comprehend blog post for step-by-step instruction on building such a capability.
If your pipeline requires redaction for specific use cases only, you can use the information in Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3 to redact sensitive data. Using this operation, an AWS Lambda function will intercept each GET request. It will redact data as necessary before it goes back to the requestor. This allows you to keep one copy of all the data and redact the data as it is requested for a specific workload. For further details, refer to the Amazon S3 Object Lambda Access Point to redact personally identifiable information (PII) from documents developer guide.
When you want to join multiple datasets from different data sources, use an Athena federated query. Using user-defined functions (UDFs) with Athena federated query will help you redact data in Amazon S3 or from other data sources such as an online transaction store like Amazon Relational Database Service (Amazon RDS), a data warehouse solution like Amazon Redshift, or a NoSQL store like Amazon DocumentDB. Athena supports UDFs, which enable you to write custom functions and invoke them in SQL queries. UDFs allow you to perform custom processing such as redacting sensitive data, compressing, and decompressing data or applying customized decryption. To read further on how you can get this set up refer to the Redacting sensitive information with user-defined functions in Amazon Athena blog post.
Redacted data lands in another S3 bucket that is now ready for any ML pipeline consumption.
Using AWS Glue DataBrew, the data preparation without writing any code. You can choose reusable recipes from over 250 pre-built transformations to automate data preparation tasks by jobs that can be scheduled based on your requirements.
Data is then used by Amazon SageMaker Data Wrangler to do feature engineering on curated data in data preparation (step 6). SageMaker Data Wrangler offers over 300 pre-configured data transformations, such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, and data/time embedding, so you can transform your data into formats that can be effectively used for models without writing a single line of code.
The output of the SageMaker Data Wrangler job is stored in Amazon SageMaker Feature Store, a purpose-built repository where you can store and access features to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
Use ML features in SageMaker notebooks or SageMaker Studio for ML training on your redacted data. SageMaker notebook instance is an ML compute instance running the Jupyter Notebook App. Amazon SageMaker Studio is a web-based, integrated development environment for ML that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio is integrated with SageMaker Data Wrangler.

Conclusion

Federal regulations require that financial institutions protect customer data. To achieve this, redact sensitive fields in your data.

In this post, we showed you how to use AWS services to meet these requirements with Amazon Comprehend and Amazon Athena. These services allow data engineers and data scientist in your organization to safely consume this data for machine learning pipelines.

CohnReznick Automates Claim Validation Workflow Using AWS AI Services

2021-07-23 Rajeswari Malladi

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/cohnreznick-automates-claim-validation-workflow-using-aws-ai-services/

This post was co-written by Winn Oo and Brendan Byam of CohnReznick and Rajeswari Malladi and Shanthan Kesharaju

CohnReznick is a leading advisory, assurance, and tax firm serving clients around the world. CohnReznick’s government and public sector practice provides claims audit and verification services for state agencies. This process begins with recipients submitting documentation as proof of their claim expenses. The supporting documentation often contains hundreds of filled-out or scanned (sometimes handwritten) PDFs, MS Word files, Excel spreadsheets, and/or pictures, along with a summary form outlining each of the claimed expenses.

Prior to automation with AWS artificial intelligence (AI) services, CohnReznick’s data extraction and validation process was performed manually. Audit professionals had to extract each data point from the submitted documentation, select a population sample for testing, and manually search the documentation for any pages or page sections that validated the information submitted. Validated data points and proof of evidence pages were then packaged into a single document and submitted for claim expense reimbursement.

In this blog post, we’ll show you how CohnReznick implemented Amazon Textract, Amazon Comprehend (with a custom machine learning classification model), and Amazon Augmented AI (Amazon A2I). With this solution, CohnReznick automated nearly 40% of the total claim verification process with focus on data extraction and package creation. This resulted in an estimated cost savings of $500k per year for each project and process.

Automating document processing workflow

Figure 1 shows the newly automated process. Submitted documentation is processed by Amazon Textract, which extracts text from the documents. This text is then submitted to Amazon Comprehend, which employs a custom classification model to classify the documents as claim summaries or evidence documents. All data points are collected from the Amazon Textract output of the claim summary documents. These data points are then validated against the evidence documents.

Finally, a population sample of the extracted data points is selected for testing. Rather than auditors manually searching for specific information in the documentation, the automated process conducts the data search, extracts the validated evidence pages from submitted documentation, and generates the audited package, which can then be submitted for reimbursement.

Figure 1. Architecture diagram

Components in the solution

At a high level, the end-to-end process starts with auditors using a proprietary web application to submit the documentation received for each case to the document processing workflow. The workflow includes three stages, as described in the following sections.

Text extraction

First, the process extracts the text from the submitted documents using the following steps:

For each case, the CohnReznick proprietary web application uploads the documents to the Amazon Simple Storage Service (Amazon S3) upload bucket. Each file has a unique name, and the files have metadata that associates them with the parent case.
The uploaded documents Amazon Simple Queue Service (Amazon SQS) queue is configured to receive notifications for all new objects added to the upload bucket. For every new document added to the upload bucket, Amazon S3 sends a notification to the uploaded documents queue.
The text extraction AWS Lambda function runs every 5 minutes to poll the uploaded documents queue for new messages.
For each message in the uploaded documents queue, the text extraction function submits an Amazon Textract job to process the document asynchronously. This continues until it reaches a predefined maximum allowed limit of concurrent jobs for that AWS account. Concurrency control is implemented by handling LimitExceededException on StartDocumentAnalysis API call.
After Amazon Textract finishes processing a document, it sends a completion notification to a completed jobs Amazon Simple Notification Service (Amazon SNS) topic.
A process job results Lambda function is subscribed to the completed jobs topic and receives a notification for every completed message sent to the completed jobs topic.
The process job results function then fetches document extraction results from Amazon Textract.
The process job results function stores the document extraction results in the Amazon Textract output bucket.

Documents classification

Next, the process classifies the documents. The submitted claim documents can consist of up to seven supporting document types. The documents need to be classified into the respective categories. They are primarily classified using automation. Any documents classified with a low confidence score are sent to a human review workflow.

Classification model creation

The custom classification feature of Amazon Comprehend is used to build a custom model to classify documents into the seven different document types as required by the business process. The model is trained by providing sample data in CSV format. Amazon Comprehend uses multiple algorithms in the training process and picks the model that delivers the highest accuracy for the training data.

Classification model invocation and processing

The automated document classification uses the trained model and the classification consists of the following steps:

The business logic in the process job results Lambda function determines text extraction completion for all documents for each case. It then calls the StartDocumentClassificationJob operation on the custom classifier model to start classifying unlabeled documents.
The document classification results from the custom classifier are returned as a single output.tar.gz file in the comprehend results S3 bucket.
At this point, the check confidence scores Lambda function is invoked, which processes the classification results.
The check confidence scores function reviews the confidence scores of classified documents. The results for documents with high confidence scores are saved to the classification results table in Amazon DynamoDB.

Human review

The documents from the automated classification that have low confidence scores are classified using human review with the following steps:

The check confidence scores Lambda function invokes human review with Amazon Augmented AI for documents with low confidence scores. Amazon A2I is a ready-to-use workflow service for human review of machine learning predictions.
The check confidence scores Lambda function creates human review tasks for each document with a low confidence score. Humans assigned to the classification jobs log into the human review portal and either approve the classification done by the model or reclassify the text with the right labels.
The results from human review are placed in the A2I results bucket.
The update results Lambda function is invoked to process results from the human review.
Finally, the update results function writes the human review document classification results to the classification results table in DynamoDB.

Additional processes

Documents workflow status capturing

The Lambda functions throughout the workflow update the status of their processing and document/case details in the workflow status table in DynamoDB. The auditor that submitted the case documents will know the status of the workflow of their submitted case using the data in workflow status table.

Search and package creation

When the processing is complete for a case, auditors perform the final review and submit the generated packet for downstream processing.

The web application uses AWS SDK for Java to integrate with the Textract output S3 bucket that has the document extraction results and classification results table in DynamoDB with classification results. This data is used for the search and package creation process.

Purge data process

After the package creation is complete, the auditor can purge all data in the workflow.

Using the AWS SDK, the data is purged from the S3 buckets and DynamoDB tables.

Conclusion

As seen in this blog post, Amazon Textract, Amazon Comprehend, and Amazon A2I for human review work together with Amazon S3, DynamoDB, and Lambda services. These services have helped CohnReznick automate nearly 40% of their total claim verification process with focus on data extraction and package creation.

You can achieve similar efficiencies and increase scalability by automating your business processes. Get started today by reading additional user stories and using the resources on automated document processing.

Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation

2021-07-12 Shekar Tippur

Post Syndicated from Shekar Tippur original https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/

You can build data lakes with millions of objects on Amazon Simple Storage Service (Amazon S3) and use AWS native analytics and machine learning (ML) services to process, analyze, and extract business insights. You can use a combination of our purpose-built databases and analytics services like Amazon EMR, Amazon Elasticsearch Service (Amazon ES), and Amazon Redshift as the right tool for your specific job and benefit from optimal performance, scale, and cost.

In this post, you learn how to create a secure data lake using AWS Lake Formation for processing sensitive data. The data (simulated patient metrics) is ingested through a serverless pipeline to identify, mask, and encrypt sensitive data before storing it securely in Amazon S3. After the data has been processed and stored, you use Lake Formation to define and enforce fine-grained access permissions to provide secure access for data analysts and data scientists.

Target personas

The proposed solution focuses on the following personas, with each one having different level of access:

Cloud engineer – As the cloud infrastructure engineer, you implement the architecture but may not have access to the data itself or to define access permissions
secure-lf-admin – As a data lake administrator, you configure the data lake setting and assign data stewards
secure-lf-business-analyst – As a business analyst, you shouldn’t be able to access sensitive information
secure-lf-data-scientist – As a data scientist, you shouldn’t be able to access sensitive information

Solution overview

We use the following AWS services for ingesting, processing, and analyzing the data:

Amazon Athena is an interactive query service that can query data in Amazon S3 using standard SQL queries using tables in an AWS Glue Data Catalog. The data can be accessed via JDBC for further processing such as displaying in business intelligence (BI) dashboards.
Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and more. The logs from AWS Glue jobs and AWS Lambda functions are saved in CloudWatch logs.
Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover information in unstructured data.
Amazon DynamoDB is a NoSQL database that delivers single-digit millisecond performance at any scale and is used to avoid processing duplicates files.
AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. This solution uses Python3.6 AWS Glue jobs for ETL processing.
AWS IoT provides the cloud services that connect your internet of things (IoT) devices to other devices and AWS Cloud services.
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services.
AWS Lake Formation makes it easy to set up, secure, and manage your data lake. With Lake Formation, you can discover, cleanse, transform, and ingest data into your data lake from various sources; define fine-grained permissions at the database, table, or column level; and share controlled access across analytic, ML, and ETL services.
Amazon S3 is a scalable object storage service that hosts the raw data files and processed files in the data lake for millisecond access.

You can enhance the security of your sensitive data with the following methods:

Implement encryption at rest using AWS Key Management Service (AWS KMS) and customer managed encryption keys
Instrument AWS CloudTrail and audit logging
Restrict access to AWS resources based on the least privilege principle

Architecture overview

The solution emulates diagnostic devices sending Message Queuing Telemetry Transport (MQTT) messages onto an AWS IoT Core topic. We use Kinesis Data Firehose to preprocess and stage the raw data in Amazon S3. We then use AWS Glue for ETL to further process the data by calling Amazon Comprehend to identify any sensitive information. Finally, we use Lake Formation to define fine-grained permissions that restrict access to business analysts and data scientists who use Athena to query the data.

The following diagram illustrates the architecture for our solution.

Prerequisites

To follow the deployment walkthrough, you need an AWS account. Use us-east-1 or us-west-2 as your Region.

For this post, make sure you don’t have Lake Formation enabled in your AWS account.

Stage the data

Download the zipped archive file to use for this solution and unzip the files locally. patient.csv file is dummy data created to help demonstrate masking, encryption, and granting fine-grained access. The send-messages.sh script randomly generates simulated diagnostic data to represent body vitals. AWS Glue job uses glue-script.py script to perform ETL that detects sensitive information, masks/encrypt data, and populates curated table in AWS Glue catalog.

Create an S3 bucket called secure-datalake-scripts-<ACCOUNT_ID> via the Amazon S3 console. Upload the scripts and CSV files to this location.

Deploy your resources

For this post, we use AWS CloudFormation to create our data lake infrastructure.

Choose Launch Stack:
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names before deploying.

The stack takes approximately 5 minutes to complete.

The following screenshot shows the key-values the stack created. We use the TestUserPassword parameter for the Lake Formation personas to sign in to the AWS Management Console.

Load the simulation data

Stage the send-messages.sh script by running the Amazon S3 copy command:

aws s3 cp s3://secure-datalake-scripts-<ACCOUNT_ID>/send-messages.sh

Run your script by using the following command:

sh send-messages.sh.

The script runs for a few minutes and emits 300 messages. This sends MQTT messages to the secure_iot_device_analytics topic, filtered using IoT rules, processed using Kinesis Data Firehose, and converted to Parquet format. After a minute, data starts showing up in the raw bucket.

Run the AWS Glue ETL pipeline

Run AWS Glue workflow (secureGlueWorkflow) from the AWS Glue console; you can also schedule to run this using CloudWatch. It takes approximately 10 minutes to complete.

The AWS Glue job that is triggered as part of the workflow (ProcessSecureData) joins the patient metadata and patient metrics data. See the following code:

# Join Patient metadata and patient metrics dataframe
combined_df=Join.apply(patient_metadata, patient_metrics, 'PatientId', 'pid', transformation_ctx = "combined_df")

The ensuing dataframe contains sensitive information like FirstName, LastName, DOB, Address1, Address2, and AboutYourself. AboutYourself is freeform text entered by the patient during registration. In the following code snippet, the detect_sensitive_info function calls the Amazon Comprehend API to identify personally identifiable information (PII):

# Apply groupBy to get unique  AboutYourself records
group=combined_df.toDF().groupBy("pid","DOB", "FirstName", "LastName", "Address1", "Address2", "AboutYourself").count()
# Apply detect_sensitive_info to get the redacted string after masking  PII data
df_with_about_yourself = Map.apply(frame = group_df, f = detect_sensitive_info)
# Apply encryption to the identified fields
df_with_about_yourself_encrypted = Map.apply(frame = group_df, f = encrypt_rows)

Amazon Comprehend returns an object that has information about the entity name and entity type. Based on your needs, you can filter the entity types that need to be masked.

These fields are masked, encrypted, and written to their respective S3 buckets where fine-grained access controls are applied via Lake Formation:

Masked data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-masked-data/
Encrypted data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-encrypted-data/
Curated data – s3://secure-data-lake-<ACCOUNT_ID>
secure-dl-curated-data/

Now that the tables have been defined, we review permissions using Lake Formation.

Enable Lake Formation fine-grained access

To enable fine-grained access, we first add a Lake Formation admin user.

On the Lake Formation console, select Add other AWS users or roles.
On the drop-down menu, choose secure-lf-admin.
Choose Get started.
In the navigation pane, choose Settings.
On the Data Catalog Settings page, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
Choose Save.

Grant access to different personas

Before we grant permissions to different user personas, let’s register the S3 locations in Lake Formation so these personas can access S3 data without granting access through AWS Identity and Access Management (IAM).

On the Lake Formation console, choose Register and ingest in the navigation pane.
Choose Data lake locations.
Choose Register location.
Find and select each of the following S3 buckets and choose Register location:
1. s3://secure-raw-bucket-<ACCOUNT_ID>/temp-raw-table
2. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-encrypted-data
3. s3://secure-data-lake-<ACCOUNT_ID>/secure-dl-curated-data
4. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-masked-data

We’re now ready to grant access to our different users.

Grant read-only access to all the tables to secure-lf-admin

First, we grant read-only access to all the tables for the user secure-lf-admin.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to AWS Lake Formation console
Under Data Catalog, choose Databases.
Select the database secure-db.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-admin.
Under Policy tags or catalog resources, select Named data catalog resources.
For Database, choose the database secure-db.
For Tables, choose All tables.
Under Permissions, select Table permissions.
For Table permissions, select Super.
Choose Grant.
Choosesecure_dl_curated_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-business-analyst

Now we grant read-only access to certain encrypted columns to the user secure-lf-business-analyst.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_encrypted_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-business-analyst.
Under Permissions, select Column-based permissions.
Choose the following columns:
1. count
2. address1_encrypted
3. firstname_encrypted
4. address2_encrypted
5. dob_encrypted
6. lastname_encrypted
For Grantable permissions, select Select.
Choose Grant.
Chose secure_dl_encrypted_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-data-scientist

Lastly, we grant read-only access to masked data to the user secure-lf-data-scientist.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_masked_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-data-scientist.
Under Permissions, select Table permissions.
For Table permissions, select Select.
Choose Grant.
Under Data Catalog, chose Tables.
Chose secure_dl_masked_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Query the data lake using Athena from different personas

To validate the permissions of different personas, we use Athena to query against the S3 data lake.

Make sure you set the query result location to the location created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>). The following screenshot shows the location information in the Settings section on the Athena console.

You can see all the tables listed under secure-db.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to Athena Console.
Run a SELECT query against the secure_dl_curated_data

The user secure-lf-admin should see all the columns with encryption or masking.

Now let’s validate the permissions of secure-lf-business-analyst user.

Sign in to the console with secure-lf-business-analyst.
Navigate to Athena console.
Run a SELECT query against the secure_dl_encrypted_data table.

The secure-lf-business-analyst user can only view the selected encrypted columns.

Lastly, let’s validate the permissions of secure-lf-data-scientist.

Sign in to the console with secure-lf-data-scientist.
Run a SELECT query against the secure_dl_masked_data table.

The secure-lf-data-scientist user can only view the selected masked columns.

If you try to run a query on different tables, such as secure_dl_curated_data, you get an error message for insufficient permissions.

Clean up

To avoid unexpected future charges, delete the CloudFormation stack.

Conclusion

In this post, we presented a potential solution for processing and storing sensitive data workloads in an S3 data lake. We demonstrated how to build a data lake on AWS to ingest, transform, aggregate, and analyze data from IoT devices in near-real time. This solution also demonstrates how you can mask and encrypt sensitive data, and use fine-grained column-level security controls with Lake Formation, which benefits those with a higher level of security needs.

Lake Formation recently announced the preview for row-level access; and you can sign up for the preview now!

About the Authors

Shekar Tippur is an AWS Partner Solutions Architect. He specializes in machine learning and analytics workloads. He has been helping partners and customers adopt best practices and discover insights from data.

Ramakant Joshi is an AWS Solution Architect, specializing in the analytics and serverless domain. He has over 20 years of software development and architecture experience, and is passionate about helping customers in their cloud journey.

Navnit Shukla is AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Forwarding emails automatically based on content with Amazon Simple Email Service

2021-05-19 Murat Balkan

Post Syndicated from Murat Balkan original https://aws.amazon.com/blogs/messaging-and-targeting/forwarding-emails-automatically-based-on-content-with-amazon-simple-email-service/

Introduction

Email is one of the most popular channels consumers use to interact with support organizations. In its most basic form, consumers will send their email to a catch-all email address where it is further dispatched to the correct support group. Often, this requires a person to inspect content manually. Some IT organizations even have a dedicated support group that handles triaging the incoming emails before assigning them to specialized support teams. Triaging each email can be challenging, and delays in email routing and support processes can reduce customer satisfaction. By utilizing Amazon Simple Email Service’s deep integration with Amazon S3, AWS Lambda, and other AWS services, the task of categorizing and routing emails is automated. This automation results in increased operational efficiencies and reduced costs.

This blog post shows you how a serverless application will receive emails with Amazon SES and deliver them to an Amazon S3 bucket. The application uses Amazon Comprehend to identify the dominant language from the message body. It then looks it up in an Amazon DynamoDB table to find the support group’s email address specializing in the email subject. As the last step, it forwards the email via Amazon SES to its destination. Archiving incoming emails to Amazon S3 also enables further processing or auditing.

Architecture

By completing the steps in this post, you will create a system that uses the architecture illustrated in the following image:

Architecture showing how to forward emails by content using Amazon SES

The flow of events starts when a customer sends an email to the generic support email address like [email protected]. This email is listened to by Amazon SES via a recipient rule. As per the rule, incoming messages are written to a specified Amazon S3 bucket with a given prefix.

This bucket and prefix are configured with S3 Events to trigger a Lambda function on object creation events. The Lambda function reads the email object, parses the contents, and sends them to Amazon Comprehend for language detection.

Amazon DynamoDB looks up the detected language code from an Amazon DynamoDB table, which includes the mappings between language codes and support group email addresses for these languages. One support group could answer English emails, while another support group answers French emails. The Lambda function determines the destination address and re-sends the same email address by performing an email forward operation. Suppose the lookup does not return any destination address, or the language was not be detected. In that case, the email is forwarded to a catch-all email address specified during the application deployment.

In this example, Amazon SES hosts the destination email addresses used for forwarding, but this is not a requirement. External email servers will also receive the forwarded emails.

Prerequisites

To use Amazon SES for receiving email messages, you need to verify a domain that you own. Refer to the documentation to verify your domain with Amazon SES console. If you do not have a domain name, you will register one from Amazon Route 53.

Deploying the Sample Application

Clone this GitHub repository to your local machine and install and configure AWS SAM with a test AWS Identity and Access Management (IAM) user.

You will use AWS SAM to deploy the remaining parts of this serverless architecture.

The AWS SAM template creates the following resources:

An Amazon DynamoDB mapping table (language-lookup) contains information about language codes and associates them with destination email addresses.
An AWS Lambda function (BlogEmailForwarder) that reads the email content parses it, detects the language, looks up the forwarding destination email address, and sends it.
An Amazon S3 bucket, which will store the incoming emails.
IAM roles and policies.

To start the AWS SAM deployment, navigate to the root directory of the repository you downloaded and where the template.yaml AWS SAM template resides. AWS SAM also requires you to specify an Amazon Simple Storage Service (Amazon S3) bucket to hold the deployment artifacts. If you haven’t already created a bucket for this purpose, create one now. You will refer to the documentation to learn how to create an Amazon S3 bucket. The bucket should have read and write access by an AWS Identity and Access Management (IAM) user.

At the command line, enter the following command to package the application:

sam package --template template.yaml --output-template-file output_template.yaml --s3-bucket BUCKET_NAME_HERE

In the preceding command, replace BUCKET_NAME_HERE with the name of the Amazon S3 bucket that should hold the deployment artifacts.

AWS SAM packages the application and copies it into this Amazon S3 bucket.

When the AWS SAM package command finishes running, enter the following command to deploy the package:

sam deploy --template-file output_template.yaml --stack-name blogstack --capabilities CAPABILITY_IAM --parameter-overrides FromEmailAddress=info@ YOUR_DOMAIN_NAME_HERE CatchAllEmailAddress=catchall@ YOUR_DOMAIN_NAME_HERE

In the preceding command, change the YOUR_DOMAIN_NAME_HERE with the domain name you validated with Amazon SES. This domain also applies to other commands and configurations that will be introduced later.

This example uses “blogstack” as the stack name, you will change this to any other name you want. When you run this command, AWS SAM shows the progress of the deployment.

Configure the Sample Application

Now that you have deployed the application, you will configure it.

Configuring Receipt Rules

To deliver incoming messages to Amazon S3 bucket, you need to create a Rule Set and a Receipt rule under it.

Note: This blog uses Amazon SES console to create the rule sets. To create the rule sets with AWS CloudFormation, refer to the documentation.

Navigate to the Amazon SES console. From the left navigation choose Rule Sets.
Choose Create a Receipt Rule button at the right pane.
Add info@YOUR_DOMAIN_NAME_HERE as the first recipient addresses by entering it into the text box and choosing Add Recipient.

Choose the Next Step button to move on to the next step.

On the Actions page, select S3 from the Add action drop-down to reveal S3 action’s details. Select the S3 bucket that was created by the AWS SAM template. It is in the format of your_stack_name-inboxbucket-randomstring. You will find the exact name in the outputs section of the AWS SAM deployment under the key name InboxBucket or by visiting the AWS CloudFormation console. Set the Object key prefix to info/. This tells Amazon SES to add this prefix to all messages destined to this recipient address. This way, you will re-use the same bucket for different recipients.

Choose the Next Step button to move on to the next step.

In the Rule Details page, give this rule a name at the Rule name field. This example uses the name info-recipient-rule. Leave the rest of the fields with their default values.

Choose the Next Step button to move on to the next step.

Review your settings on the Review page and finalize rule creation by choosing Create Rule

In this example, you will be hosting the destination email addresses in Amazon SES rather than forwarding the messages to an external email server. This way, you will be able to see the forwarded messages in your Amazon S3 bucket under different prefixes. To host the destination email addresses, you need to create different rules under the default rule set. Create three additional rules for catchall@YOUR_DOMAIN_NAME_HERE , english@ YOUR_DOMAIN_NAME_HERE and french@YOUR_DOMAIN_NAME_HERE email addresses by repeating the steps 2 to 5. For Amazon S3 prefixes, use catchall/, english/, and french/ respectively.

Configuring Amazon DynamoDB Table

To configure the Amazon DynamoDB table that is used by the sample application

Navigate to Amazon DynamoDB console and reach the tables view. Inspect the table created by the AWS SAM application.

language-lookup table is the table where languages and their support group mappings are kept. You need to create an item for each language, and an item that will hold the default destination email address that will be used in case no language match is found. Amazon Comprehend supports more than 60 different languages. You will visit the documentation for the supported languages and add their language codes to this lookup table to enhance this application.

To start inserting items, choose the language-lookup table to open table overview page.
Select the Items tab and choose the Create item From the dropdown, select Text. Add the following JSON content and choose Save to create your first mapping object. While adding the following object, replace Destination attribute’s value with an email address you own. The email messages will be forwarded to that address.

{

“language”: “en”,

“destination”: “english@YOUR_DOMAIN_NAME_HERE”

}

Lastly, create an item for French language support.

{

“language”: “fr”,

“destination”: “french@YOUR_DOMAIN_NAME_HERE”

}

Testing

Now that the application is deployed and configured, you will test it.

Use your favorite email client to send the following email to the domain name info@ email address.

Subject: I need help

Body:

Hello, I’d like to return the shoes I bought from your online store. How can I do this?

After the email is sent, navigate to the Amazon S3 console to inspect the contents of the Amazon S3 bucket that is backing the Amazon SES Rule Sets. You will also see the AWS Lambda logs from the Amazon CloudWatch console to confirm that the Lambda function is triggered and run successfully. You should receive an email with the same content at the address you defined for the English language.

Next, send another email with the same content, this time in French language.

Subject: j’ai besoin d’aide

Body:

Bonjour, je souhaite retourner les chaussures que j’ai achetées dans votre boutique en ligne. Comment puis-je faire ceci?

Suppose a message is not matched to a language in the lookup table. In that case, the Lambda function will forward it to the catchall email address that you provided during the AWS SAM deployment.

You will inspect the new email objects under english/, french/ and catchall/ prefixes to observe the forwarding behavior.

Continue experimenting with the sample application by sending different email contents to info@ YOUR_DOMAIN_NAME_HERE address or adding other language codes and email address combinations into the mapping table. You will find the available languages and their codes in the documentation. When adding a new language support, don’t forget to associate a new email address and Amazon S3 bucket prefix by defining a new rule.

Cleanup

To clean up the resources you used in your account,

Navigate to the Amazon S3 console and delete the inbox bucket’s contents. You will find the name of this bucket in the outputs section of the AWS SAM deployment under the key name InboxBucket or by visiting the AWS CloudFormation console.
Navigate to AWS CloudFormation console and delete the stack named “blogstack”.
After the stack is deleted, remove the domain from Amazon SES. To do this, navigate to the Amazon SES Console and choose Domains from the left navigation. Select the domain you want to remove and choose Remove button to remove it from Amazon SES.
From the Amazon SES Console, navigate to the Rule Sets from the left navigation. On the Active Rule Set section, choose View Active Rule Set button and delete all the rules you have created, by selecting the rule and choosing Action, Delete.
On the Rule Sets page choose Disable Active Rule Set button to disable listening for incoming email messages.
On the Rule Sets page, Inactive Rule Sets section, delete the only rule set, by selecting the rule set and choosing Action, Delete.
Navigate to CloudWatch console and from the left navigation choose Logs, Log groups. Find the log group that belongs to the BlogEmailForwarderFunction resource and delete it by selecting it and choosing Actions, Delete log group(s).
You will also delete the Amazon S3 bucket you used for packaging and deploying the AWS SAM application.

Conclusion

This solution shows how to use Amazon SES to classify email messages by the dominant content language and forward them to respective support groups. You will use the same techniques to implement similar scenarios. You will forward emails based on custom key entities, like product codes, or you will remove PII information from emails before forwarding with Amazon Comprehend.

With its native integrations with AWS services, Amazon SES allows you to enhance your email applications with different AWS Cloud capabilities easily.

To learn more about email forwarding with Amazon SES, you will visit documentation and AWS blogs.

Solution Overview

Solution Walkthrough

Step 1: Upload Documents to Be Scanned to S3

Step 2: Creating a Custom Classifiction Model

Custom Classification Model

Step 3: Creating an Endpoint in Amazon Comprehend

Step 4: Scanning Text with the Custom Classification Endpoint

Real-time analysis:

Real-time analysis in Amazon Comprehend:

Step 5 (optional): Publish message to communication service

Conclusion

About the Authors

Caroline Des Rochers

Erika Houde Pearce

Koushik Mani

#10: Build a serverless retail solution for endless aisle on AWS

#9: Optimizing data with automated intelligent document processing solutions

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

#6: Let’s Architect! Designing event-driven architectures

#5: Use a reusable ETL framework in your AWS lake house architecture

#4: Invoking asynchronous external APIs with AWS Step Functions

#3: Announcing updates to the AWS Well-Architected Framework

#2: Let’s Architect! Designing architectures for multi-tenancy

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Bonus! Three older special mentions

Why it’s challenging to process and manage unstructured data

Solution overview

Process unstructured data with AWS AI services

Implement access control on raw and processed data in Amazon S3

Conclusion

About the Authors

Customer document management challenges

Automated intelligent document processing solution

Solution overview

Customer benefits

Vertical industry use cases

Conclusion

Steps illustrated with more detail:

Cleanup

Conclusion

Real world language detection solution architecture

Leverage SNS fanout pattern for more complex use cases

Conclusion

1. AWS Best Practices for DDoS Resiliency

2. Predictive Modeling for Automotive Retail

3. AWS Graviton Performance Testing – Tips for Independent Software Vendors

4. Text Analysis with Amazon OpenSearch Service and Amazon Comprehend

5. Back to Basics: Hosting a Static Website on AWS

Overview of solution

Prerequisites

Solution steps

Solution description

Code sample

Cleaning up

Conclusion

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Data democratization and the need for self-service BI

Solution overview

Solution details

Stage-one: Enrich the Data Catalog with a standard built-in Amazon Comprehend entity detector

Stage-two: Use Amazon Comprehend custom entity recognition

Train an Amazon Comprehend custom entity recognition model

Run custom entity recognition inference

Create an AWS Glue job to update the Data Catalog

Expose Data Catalog data to data consumers for search and discovery

Conclusion

About the Authors

Automated document processing architecture

Document ingestion and classification

Document processing workflow

Overview

Entity extraction

Human review

Conclusion

High-level architecture for implementing an AWS lake house

Building a lake house on AWS

Building the data ingestion layer

Building the data lake layer

Defining the governance and transformation layer