Tag Archives: Amazon Comprehend

Amazon Translate now supports Office documents

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-translate-now-supports-office-documents/

Whether your organization is a multinational enterprise present in many countries, or a small startup hungry for global success, translating your content to local languages may be an enduring challenge. Indeed, text data often comes in many formats, and processing them may require several different tools. Also, as all these tools may not support the same language pairs, you may have to convert certain documents to intermediate formats, or even resort to manual translation. All these issues add extra cost, and create unnecessary complexity in building consistent and automated translation workflows.

Amazon Translate aims at solving these problems in a simple and cost effective fashion. Using either the AWS console or a single API call, Amazon Translate makes it easy for AWS customers to quickly and accurately translate text in 55 different languages and variants.

Earlier this year, Amazon Translate introduced batch translation for plain text and HTML documents. Today, I’m very happy to announce that batch translation now also supports Office documents, namely .docx, .xlsx and .pptx files as defined by the Office Open XML standard.

Introducing Amazon Translate for Office Documents
The process is extremely simple. As you would expect, source documents have to be stored in an Amazon Simple Storage Service (S3) bucket. Please note that no document may be larger than 20 Megabytes, or have more than 1 million characters.

Each batch translation job processes a single file type and a single source language. Thus, we recommend that you organize your documents in a logical fashion in S3, storing each file type and each language under its own prefix.

Then, using either the AWS console or the StartTextTranslationJob API in one of the AWS language SDKs, you can launch a translation job, passing:

  • the input and output location in S3,
  • the file type,
  • the source and target languages.

Once the job is complete, you can collect translated files at the output location.

Let’s do a quick demo!

Translating Office Documents
Using the S3 console, I first upload a few .docx documents to one of my buckets.

S3 files

Then, moving to the Translate console, I create a new batch translation job, giving it a name, and selecting both the source and target languages.

Creating a batch job

Then, I define the location of my documents in S3, and their format, .docx in this case. Optionally, I could apply a custom terminology, to make sure specific words are translated exactly the way that I want.

Likewise, I define the output location for translated files. Please make sure that this path exists, as Translate will not create it for you.

Creating a batch job

Finally, I set the AWS Identity and Access Management (IAM) role, giving my Translate job the appropriate permissions to access S3. Here, I use an existing role that I created previously, and you can also let Translate create one for you. Then, I click on ‘Create job’ to launch the batch job.

Creating a batch job

The job starts immediately.

Batch job running

A little while later, the job is complete. All three documents have been translated successfully.

Viewing a completed job

Translated files are available at the output location, as visible in the S3 console.

Viewing translated files

Downloading one of the translated files, I can open it and compare it to the original version.

Comparing files

For small scale use, it’s extremely easy to use the AWS console to translate Office files. Of course, you can also use the Translate API to build automated workflows.

Automating Batch Translation
In a previous post, we showed you how to automate batch translation with an AWS Lambda function. You could expand on this example, and add language detection with Amazon Comprehend. For instance, here’s how you could combine the DetectDominantLanguage API with the Python-docx open source library to detect the language of .docx files.

import boto3, docx
from docx import Document

document = Document('blog_post.docx')
text = document.paragraphs[0].text
comprehend = boto3.client('comprehend')
response = comprehend.detect_dominant_language(Text=text)
top_language = response['Languages'][0]
code = top_language['LanguageCode']
score = top_language['Score']
print("%s, %f" % (code,score))

Pretty simple! You could also detect the type of each file based on its extension, and move it to the proper input location in S3. Then, you could schedule a Lambda function with CloudWatch Events to periodically translate files, and send a notification by email. Of course, you could use AWS Step Functions to build more elaborate workflows. Your imagination is the limit!

Getting Started
You can start translating Office documents today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Seoul).

If you’ve never tried Amazon Translate, did you know that the free tier offers 2 million characters per month for the first 12 months, starting from your first translation request?

Give it a try, and let us know what you think. We’re looking forward to your feedback: please post it to the AWS Forum for Amazon Translate, or send it to your usual AWS support contacts.

– Julien

Measure Effectiveness of Virtual Training in Real Time with AWS AI Services

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/measure-effectiveness-of-virtual-training-in-real-time-with-aws-ai-services/

As per International Data Corporation (IDC), worldwide spending on digital transformation will reach $2.3 trillion in 2023. As organizations adopt digital transformation, training becomes an important aspect of this journey. Whether these are internal trainings to upskill existing workforce or a packaged content for commercial use, these trainings need to be efficient and cost effective. With the advent of education technology, it is a common practice to deliver trainings via digital platforms. This makes it accessible for larger population and is cost effective, but it is important that the trainings are interactive and effective. According to  a recent article published by Forbes, immersive education and data driven insights are among the top five Education Technology (EdTech) innovations. These are the key characteristics of creating an effective training experience.

An earlier blog series explored how to build a virtual trainer on AWS using Amazon Sumerian. This series illustrated how to easily build an immersive and highly engaging virtual training experience without needing additional devices or a complex virtual reality platform management. These trainings are easy to maintain and are cost effective.

In this blog post, we will further extend the architecture to gather real-time feedback about the virtual trainings and create data-driven insights to measure its effectiveness with the help of Amazon artificial intelligence (AI) services.

Architecture and its benefits

Virtual training on AWS and AI Services - Architecture

Virtual training on AWS and AI Services – Architecture

Consider a scenario where you are a vendor in the health care sector. You’ve developed a cutting-edge device, such as patient vital monitoring hardware that goes through frequent software upgrades and it is about to be rolled out across different U.S. hospitals. The nursing staff needs to be well trained before it can begin using the device. Let’s take a look at an architecture to solve this problem. We will first explain the architecture for building the training and then we will show how we can measure its effectiveness.

At the core of the architecture is Amazon Sumerian. Sumerian is a managed service that lets you create and run 3D, Augmented Reality (AR), and Virtual Reality (VR) applications. Within Sumerian, real-life scenes from a hospital environment can be created by importing the assets from the assets library. Scenes consist of host(s) and an AI-driven animated character with built-in animation, speech, and behavior. The hosts act as virtual trainers that interact with the nursing staff. The speech component assigns text to the virtual trainer for playback with Amazon Polly. Polly helps convert training content from Sumerian to life-like speech in real time and ensures the nursing staff receives the latest content related to the equipment on which it’s being trained.

The nursing staff accesses the training via web browsers on iOS or Android mobile devices or laptops, and authenticates using Amazon Cognito. Cognito is a service that lets you easily add user sign-up and authentication to your mobile and web apps. Sumerian then uses the Cognito identity pool to create temporary credentials to access AWS services.

The flow of the interactions within Sumerian is controlled using a visual state machine in the Sumerian editor. Within the editor, the dialogue component assigns an Amazon Lex chatbot to an entity, in this case the virtual trainer or host. Lex is a service for building conversational interfaces with voice and text. It provides you the ability to have interactive conversations with the nursing staff, understand its areas of interest, and deliver appropriate training material. This is an important aspect of the architecture where you can customize the training per users’ needs.

Lex has native interoperability with AWS Lambda, a serverless compute offering where you just write and run your code in Lambda functions. Lambda can be used to validate user inputs or apply any business logic, such as fetching the user selected training material from Amazon DynamoDB (or another database) in real time. This material is then delivered to Lex as a response to user queries.

You can extend the state machine within the Sumerian editor to introduce new interactive flows to collect user feedback. Amazon Lex collects user feedback, which is saved in Amazon Simple Storage Service (S3) and analyzed by Amazon Comprehend. Amazon Comprehend is a natural language processing service that uses AI to find meaning and insights/sentiments in text. Insights from user feedback are stored in S3, which is a highly scalable, durable, and highly available object storage.

You can analyze the insights from user feedback using Amazon Athena, an interactive query service which analyzes data in S3 using standard SQL. You can then easily build visualizations using Amazon QuickSight.

By using this architecture, you not only deliver the virtual training to your nursing staff in an immersive environment created by Amazon Sumerian, but you can also gather the feedback interactively. You can gain insights from this feedback and iterate over it to make the training experience more effective.

Conclusion and next steps

In this blog post we reviewed the architecture to build interactive trainings and measure their effectiveness. The serverless nature of this architecture makes it cost effective, agile, and easy to manage, and you can apply it to a number of use cases. For example, an educational institution can develop training content designed for multiple learning levels and the training level can be adjusted in real time based on live interactions with the students. In the manufacturing scenario, you can build a digital twin of your process and train your resources to handle different scenarios with full interactions. You can integrate AWS services just like Lego blocks, and you can further expand this architecture to integrate with Amazon Kendra to build interactive FAQ or integrate with Amazon Comprehend Medical to build trainings for the healthcare industry. Happy building!

Creating a searchable enterprise document repository

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/creating-a-searchable-enterprise-document-repository/

Enterprise customers frequently have repositories with thousands of documents, images and other media. While these contain valuable information for users, it’s often hard to index and search this content. One challenge is interpreting the data intelligently, and another issue is processing these files at scale.

In this blog post, I show how you can deploy a serverless application that uses machine learning to interpret your documents. The architecture includes a queueing mechanism for handling large volumes, and posting the indexing metadata to an Amazon Elasticsearch domain. This solution is scalable and cost effective, and you can modify the functionality to meet your organization’s requirements.

The overall architecture for a searchable document repository solution.

The application takes unstructured data and applies machine learning to extract context for a search service, in this case Elasticsearch. There are many business uses for this design. For example, this could help Human Resources departments in searching for skills in PDF resumes. It can also be used by Marketing departments or creative groups with a large number of images, providing a simple way to search for items within the images.

As documents and images are added to the S3 bucket, these events invoke AWS Lambda functions. This runs custom code to extract data from the files, and also calls Amazon ML services for interpretation. For example, when adding a resume, the Lambda function extracts text from the PDF file while Amazon Comprehend determines the key phrases and topics in the document. For images, it uses Amazon Rekognition to determine the contents. In both cases, once it identifies the indexing attributes, it saves the data to Elasticsearch.

The code uses the AWS Serverless Application Model (SAM), enabling you to deploy the application easily in your own AWS Account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost for large data imports. Additionally, it requires an Elasticsearch domain, which may incur cost on your AWS bill.

To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.

Creating an Elasticsearch domain

This application requires an Elasticsearch development domain for testing purposes. To learn more about production configurations, see the Elasticsearch documentation.

To create the test domain:

  1. Navigate to the Amazon Elasticsearch console. Choose Create a new domain.
  2. For Deployment type, choose Development and testing. Choose Next.
  3. In the Configure Domain page:
    1. For Elasticsearch domain name, enter serverless-docrepo.
    2. Change Instance Type to t2.small.elasticsearch.
    3. Leave all the other defaults. Choose Next at the end of the page.
  4. In Network Configuration, choose Public access. This is adequate for a tutorial but in a production use-case, it’s recommended to use VPC access.
  5. Under Access Policy, in the Domain access policy dropdown, choose Custom access policy.
  6. Select IAM ARN and in the Enter principal field, enter your AWS account ID (learn how to find your AWS account ID). In the Select Action dropdown, select Allow.
  7. Under Encryption, leave HTTPS checked. Choose Next.
  8. On the Review page, review your domain configuration, and then choose Confirm.

Your domain is now being configured, and the Domain status shows Loading in the Overview tab.

After creating the Elasticsearch domain, the domain status shows ‘Loading’.

It takes 10-15 minutes to fully configure the domain. Wait until the Domain status shows Active before continuing. When the domain is ready, note the Endpoint address since you need this in the application deployment.

The Endpoint for an Elasticsearch domain.

Deploying the application

After cloning the repo to your local development machine, open a terminal window and change to the cloned directory.

  1. Run the SAM build process to create an installation package, and then deploy the application:
    sam build
    sam deploy --guided
  2. In the guided deployment process, enter unique names for the S3 buckets when prompted. For the ESdomain parameter, enter the Elasticsearch domain Endpoint from the previous section.
  3. After the deployment completes, note the Lambda function ARN shown in the output:
    Lambda function ARN output.
  4. Back in the Elasticsearch console, select the Actions dropdown and choose Modify Access Policy. Paste the Lambda function ARN as an AWS Principal in the JSON, in addition to the root user, as follows:
    Modify access policy for Elasticsearch.
  5. Choose Submit. This grants the Lambda function access to the Elasticsearch domain.

Testing the application

To test the application, you need a few test documents and images with the file types DOCX (Microsoft Word), PDF, and JPG. This walkthrough uses multiple files to illustrate the queuing process.

  1. Navigate to the S3 console and select the Documents bucket from your deployment.
  2. Choose Upload and select your sample PDF or DOCX files:
    Upload files to S3.
  3. Choose Next on the following three pages to complete the upload process. The application now analyzes these documents and adds the indexing information to Elasticsearch.

To query Elasticsearch, first you must generate an Access Key ID and Secret Access Key. For detailed steps, see this documentation on creating security credentials. Next, use Postman to create an HTTP request for the Elasticsearch domain:

  1. Download and install Postman.
  2. From Postman, enter the Elasticsearch endpoint, adding /_search?q=keyword. Replace keyword with a search term.
  3. On the Authorization tab, complete the Access Key ID and Secret Access Key fields with the credentials you created. For Service Name, enter es.Postman query with AWS authorization.
  4. Choose Send. Elasticsearch responds with document results matching your search term.REST response from Elasticsearch.

How this works

This application creates a processing pipeline between the originating Documents bucket and the Elasticsearch domain. Each document type has a custom parser, preparing the content in a Queuing bucket. It uses an Amazon SQS queue to buffer work, which is fetched by a Lambda function and analyzed with Amazon Comprehend. Finally, the indexing metadata is saved in Elasticsearch.

Serverless architecture for text extraction and classification.

  1. Documents and images are saved in the Documents bucket.
  2. Depending upon the file type, this triggers a custom parser Lambda function.
    1. For PDFs and DOCX files, the extracted text is stored in a Staging bucket. If the content is longer than 5,000 characters, it is broken into smaller files by a Lambda function, then saved in the Queued bucket.
    2. For JPG files, the parser uses Amazon Rekognition to detect the contents of the image. The labeling metadata is stored in the Queued bucket.
  3. When items are stored in the Queued bucket, this triggers a Lambda function to add the job to an SQS queue.
  4. The Analyze function is invoked when there are messages in the SQS queue. It uses Amazon Comprehend to find entities in the text. This function then stores the metadata in Elasticsearch.

S3 and Lambda both scale to handle the traffic. The Elasticsearch domain is not serverless, however, so it’s possible to overwhelm this instance with requests. There may be a large number of objects stored in the Documents bucket triggering the workflow, so the application uses SQS couple to smooth out the traffic. When large numbers of objects are processed, you see the Messages Available increase in the SQS queue:

SQS queue buffers messages to smooth out traffic.
For the Lambda function consuming messages from the SQS queue, the BatchSize configured in the SAM template controls the rate of processing. The function continues to fetch messages from the SQS queue while Messages Available is greater than zero. This can be a useful mechanism for protecting downstream services that are not serverless, or simply cannot scale to match the processing rates of the upstream application. In this case, it provides a more consistent flow of indexing data to the Elasticsearch domain.

In a production use-case, you would scale the Elasticsearch domain depending upon the load of search requests, not just the indexing traffic from this application. This tutorial uses a minimal Elasticsearch configuration, but this service is capable of supporting enterprise-scale use-cases.

Conclusion

Enterprise document repositories can be a rich source of information but can be difficult to search and index. In this blog post, I show how you can use a serverless approach to build scalable solution easily. With minimum code, we can use Amazon ML services to create the indexing metadata. By using powerful image recognition and language comprehension capabilities, this makes the metadata more useful and the search solution more accurate.

This also shows how serverless solutions can be used with existing non-serverless infrastructure, like Elasticsearch. By decoupling scalable serverless applications, you can protect downstream services from heavy traffic loads, even as Lambda scales up. Elasticsearch provides a fast, easy way to query your document repository once the serverless application has completed the indexing process.

To learn more about how to use Elasticsearch for production workloads, see the documentation on managing domains.

Converting call center recordings into useful data for analytics

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/converting-call-center-recordings-into-useful-data-for-analytics/

Many businesses operate call centers that record conversations with customers for training or regulatory purposes. These vast collections of audio offer unique opportunities for improving customer service. However, since audio data is mostly unsearchable, it’s usually archived in these systems and never analyzed for insights.

Developing machine learning models for accurately understanding and transcribing speech is also a major challenge. These models require large datasets for optimal performance, along with teams of experts to build and maintain the software. This puts it out of reach for the majority of businesses and organizations. Fortunately, you can use AWS services to handle this difficult problem.

In this blog post, I show how you can use a serverless approach to analyze audio data from your call center. You can clone this application from the GitHub repo and modify to meet your needs. The solution uses Amazon ML services, together with scalable storage, and serverless compute. The example application has the following architecture:

The architecture for the call center audio analyzer.

For call center analysis, this application is useful to determine the types of general topics that customers are calling about. It can also detect the sentiment of the conversation, so if the call is a compliment or a complaint, you could take additional action. When combined with other metadata such as caller location or time of day, this can yield important insights to help you improve customer experience. For example, you might discover there are common service issues in a geography at a certain time of day.

To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.

How the application works

A key part of the serverless solution is Amazon S3, an object store that scales to meet your storage needs. When new objects are stored, this triggers AWS Lambda functions, which scale to keep pace with S3 usage. The application coordinates activities between the S3 bucket and two managed Machine Learning (ML) services, storing the results in an Amazon DynamoDB table.

The ML services used are:

  • Amazon Transcribe, which transcribes audio data into JSON output, using a process called automatic speech recognition. This can understand 31 languages and dialects, and identify different speakers in a customer support call.
  • Amazon Comprehend, which offers sentiment analysis as one of its core features. This service returns an array of scores to estimate the probability that the input text is positive, negative, neutral, or mixed.

Sample application architecture.

  1. A downstream process, such as a call recording system, stores audio data in the application’s S3 bucket.
  2. When the MP3 objects are stored, this triggers the Transcribe function. The function creates a new job in the Amazon Transcribe service.
  3. When the transcription process finishes, Transcribe stores the JSON result in the same S3 bucket.
  4. This JSON object triggers the Sentiment function. The Sentiment function requests a sentiment analysis from the Comprehend service.
  5. After receiving the sentiment scores, this function stores the results in a DynamoDB table.

There is only one bucket used in the application. The two Lambda functions are triggered by the same bucket, using different object suffixes. This is configured in the SAM template, shown here:

  SentimentFunction:
    ...
      Events:
        FileUpload:
          Type: S3
          Properties:
            Bucket: !Ref InputS3Bucket
            Events: s3:ObjectCreated:*
            Filter: 
              S3Key:
                Rules:
                  - Name: suffix
                    Value: '.json'              

  TranscribeFunction:
    ... 
      Events:
        FileUpload:
          Type: S3
          Properties:
            Bucket: !Ref InputS3Bucket
            Events: s3:ObjectCreated:*
            Filter: 
              S3Key:
                Rules:
                  - Name: suffix
                    Value: '.mp3'    

Testing the application

To test the application, you need an MP3 audio file containing spoken text. For example, in my testing, I use audio files of a person reading business reviews representing positive, neutral, and negative experiences.

  1. After cloning the GitHub repo, follow the instructions in the README.md file to deploy the application. Note the name of the S3 bucket output in the deployment.SAM deployment CLI output
  2. Upload your test MP3 files using this command in a terminal, replacing your-bucket-name with the deployed bucket name:aws s3 cp .\ s3://your-bucket-name --recursiveOnce executed, your terminal shows the uploaded media files:

    Uploading sample media files.

  3.  Navigate to the Amazon Transcribe console, and choose Transcription jobs in the left-side menu. The MP3 files you uploaded appear here as separate jobs:Amazon Transcribe jobs in progress
  4. Once the Status column shows all pending job as Complete, navigate to the DynamoDB console.
  5. Choose Tables from the left-side menu and select the table created by the deployment. Choose the Items tab:Sentiment scores in the DynamoDB table
    Each MP3 file appears as a separate item with a sentiment rating and a probability for each sentiment category. It also includes the transcript of the audio.

Handling multiple languages

One of the most useful aspects of serverless architecture is the ability to add functionality easily. For call centers handling multiple languages, ideally you should translate to a common language for sentiment scoring. With this application, it’s easy to add an extra step to the process to translate the transcription language to a base language:

Advanced application architecture

A new Translate Lambda function is invoked by the S3 JSON suffix filter and creates text output in a common base language. The sentiment scoring function is triggered by new objects with the suffix TXT.

In this modified case, when the MP3 audio file is uploaded to S3, you can append the language identifier as metadata to the object. For example, to upload an MP3 with a French language identifier using the AWS CLI:

aws s3 cp .\test-audio-fr.mp3 s3://your-bucket --metadata Content-Language=fr-FR

The first Lambda function passes the language identifier to the Transcribe service. In the Transcribe console, the language appears in the new job:

French transcription job complete

After the job finishes, the JSON output is stored in the same S3 bucket. It shows the transcription from the French language audio:

French transcription output

The new Translate Lambda function passes the transcript value into the Amazon Translate service. This converts the French to English and saves the translation as a text file. The sentiment Lambda function now uses the contents of this text file to generate the sentiment scores.

This approach allows you to accept audio in a wide range of spoken languages but standardize your analytics in one base language.

Developing for extensibility

You might want to take action on phone calls that have a negative sentiment score, or publish scores to other applications in your organization. This architecture makes it simple to extend functionality once DynamoDB saves the sentiment scores. By using DynamoDB Streams, you can invoke a Lambda function each time a record is created or updated in the underlying DynamoDB table:

Adding notifications to the application

In this case, the routing function could trigger an email via Amazon SES where the sentiment score is negative. For example, this could email a manager to follow up with the customer. Alternatively, you may choose to publish all scores and results to any downstream application with Amazon EventBridge. By publishing events to the default event bus, you can allow consuming applications to build new functionality without needing any direct integration.

Deferred execution in Amazon Transcribe

The services used in the example application are all highly scalable and highly available, and can handle significant amounts of data. Amazon Transcribe allows up to 100 concurrent transcription jobs – see the service limits and quotas for more information.

The service also provides a mechanism for deferred execution, which allows you to hold jobs in a queue. When the numbering of executing jobs falls below the concurrent execution limit, the service takes the next job from this queue. This effectively means you can submit any number of jobs to the Transcribe service, and it manages the queue and processing automatically.

To use this feature, there are two additional attributes used in the startTranscriptionJob method of the AWS.TranscribeService object. When added to the Lambda handler in the Transcribe function, the code looks like this:

Deferred execution for Amazon Transcribe

After setting AllowDeferredExecution to true, you must also provide an IAM role ARN in the DataAccessRoleArn attribute. For more information on how to use this feature, see the Transcribe documentation for job execution settings.

Conclusion

In this blog post, I show how to transcribe the content of audio files and calculate a sentiment score. This can be useful for organizations wanting to analyze saved audio for customer calls, webinars, or team meetings.

This solution uses Amazon ML services to handle the audio and text analysis, and serverless services like S3 and Lambda to manage the storage and business logic. The serverless application here can scale to handle large amounts of production data. You can also easily extend the application to provide new functionality, built specifically for your organization’s use-case.

To learn more about building serverless applications at scale, visit the AWS Serverless website.

New – Amazon Comprehend Medical Adds Ontology Linking

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-comprehend-medical-adds-ontology-linking/

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights in unstructured text. It is very easy to use, with no machine learning experience required. You can customize Comprehend for your specific use case, for example creating custom document classifiers to organize your documents into your own categories, or custom entity types that analyze text for your specific terms. However, medical terminology can be very complex and specific to the healthcare domain.

For this reason, we introduced last year Amazon Comprehend Medical, a HIPAA eligible natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records.

Today, we are adding the capability of linking the information extracted by Comprehend Medical to medical ontologies.

An ontology provides a declarative model of a domain that defines and represents the concepts existing in that domain, their attributes, and the relationships between them. It is typically represented as a knowledge base, and made available to applications that need to use or share knowledge. Within health informatics, an ontology is a formal description of a health-related domain.

The ontologies supported by Comprehend Medical are:

  • ICD-10-CM, to identify medical conditions as entities and link related information such as diagnosis, severity, and anatomical distinctions as attributes of that entity. This is a diagnosis code set that is very useful for population health analytics, and for getting payments from insurance companies based on medical services rendered.
  • RxNorm, to identify medications as entities and link attributes such as dose, frequency, strength, and route of administration to that entity. Healthcare providers use these concepts to enable use cases like medication reconciliation, which is is the process of creating the most accurate list possible of all medications a patient is taking.

For each ontology, Comprehend Medical returns a ranked list of potential matches. You can use confidence scores to decide which matches make sense, or what might need further review. Let’s see how this works with an example.

Using Ontology Linking
In the Comprehend Medical console, I start by giving some unstructured, doctor notes in input:

At first, I use some functionalities that were already available in Comprehend Medical to detect medical and protected health information (PHI) entities.

Among the recognized entities (see this post for more info) there are some symptoms and medications. Medications are recognized as generics or brands. Let’s see how we can connect some of these entities to more specific concepts.

I use the new features to link those entities to RxNorm concepts for medications.

In the text, only the parts mentioning medications are detected. In the details of the answer, I see more information. For example, let’s look at one of the detected medications:

  • The first occurrence of the term “Clonidine” (in second line in the input text above) is linked to the generic concept (on the left in the image below) in the RxNorm ontology.
  • The second occurrence of the term “Clonidine” (in the fourth line in the input text above) is followed by an explicit dosage, and is linked to a more prescriptive format that includes dosage (on the right in the image below) in the RxNorm ontology.

To look for for medical conditions using ICD-10-CM concepts, I am giving a different input:

The idea again is to link the detected entities, like symptoms and diagnoses, to specific concepts.

As expected, diagnoses and symptoms are recognized as entities. In the detailed results those entities are linked to the medical conditions in the ICD-10-CM ontology. For example, the two main diagnoses described in the input text are the top results, and specific concepts in the ontology are inferred by Comprehend Medical, each with its own score.

In production, you can use Comprehend Medical via API, to integrate these functionalities with your processing workflow. All the screenshots above render visually the structured information returned by the API in JSON format. For example, this is the result of detecting medications (RxNorm concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9933062195777893,
            "BeginOffset": 83,
            "EndOffset": 92,
            "Attributes": [],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine",
                    "Code": "2599",
                    "Score": 0.9148101806640625
                },
                {
                    "Description": "168 HR Clonidine 0.00417 MG/HR Transdermal System",
                    "Code": "998671",
                    "Score": 0.8215734958648682
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.7519310116767883
                },
                {
                    "Description": "10 ML Clonidine Hydrochloride 0.5 MG/ML Injection",
                    "Code": "884225",
                    "Score": 0.7171697020530701
                },
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.6776907444000244
                }
            ]
        },
        {
            "Id": 1,
            "Text": "Vyvanse",
            "Category": "MEDICATION",
            "Type": "BRAND_NAME",
            "Score": 0.9995427131652832,
            "BeginOffset": 148,
            "EndOffset": 155,
            "Attributes": [
                {
                    "Type": "DOSAGE",
                    "Score": 0.9910679459571838,
                    "RelationshipScore": 0.9999822378158569,
                    "Id": 2,
                    "BeginOffset": 156,
                    "EndOffset": 162,
                    "Text": "50 mgs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9997182488441467,
                    "RelationshipScore": 0.9993833303451538,
                    "Id": 3,
                    "BeginOffset": 163,
                    "EndOffset": 165,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.983681321144104,
                    "RelationshipScore": 0.9999642372131348,
                    "Id": 4,
                    "BeginOffset": 166,
                    "EndOffset": 184,
                    "Text": "at breakfast daily",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Oral Capsule [Vyvanse]",
                    "Code": "854852",
                    "Score": 0.8883932828903198
                },
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Chewable Tablet [Vyvanse]",
                    "Code": "1871469",
                    "Score": 0.7482635378837585
                },
                {
                    "Description": "Vyvanse",
                    "Code": "711043",
                    "Score": 0.7041242122650146
                },
                {
                    "Description": "lisdexamfetamine dimesylate 70 MG Oral Capsule [Vyvanse]",
                    "Code": "854844",
                    "Score": 0.23675969243049622
                },
                {
                    "Description": "lisdexamfetamine dimesylate 60 MG Oral Capsule [Vyvanse]",
                    "Code": "854848",
                    "Score": 0.14077001810073853
                }
            ]
        },
        {
            "Id": 5,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9982216954231262,
            "BeginOffset": 199,
            "EndOffset": 208,
            "Attributes": [
                {
                    "Type": "STRENGTH",
                    "Score": 0.7696017026901245,
                    "RelationshipScore": 0.9999960660934448,
                    "Id": 6,
                    "BeginOffset": 209,
                    "EndOffset": 216,
                    "Text": "0.2 mgs",
                    "Traits": []
                },
                {
                    "Type": "DOSAGE",
                    "Score": 0.777644693851471,
                    "RelationshipScore": 0.9999927282333374,
                    "Id": 7,
                    "BeginOffset": 220,
                    "EndOffset": 236,
                    "Text": "1 and 1 / 2 tabs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9981689453125,
                    "RelationshipScore": 0.999950647354126,
                    "Id": 8,
                    "BeginOffset": 237,
                    "EndOffset": 239,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.99753737449646,
                    "RelationshipScore": 0.9999889135360718,
                    "Id": 9,
                    "BeginOffset": 240,
                    "EndOffset": 243,
                    "Text": "qhs",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.9600071907043457
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.8955953121185303
                },
                {
                    "Description": "24 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "885880",
                    "Score": 0.8706559538841248
                },
                {
                    "Description": "12 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "1013937",
                    "Score": 0.786146879196167
                },
                {
                    "Description": "Chlorthalidone 15 MG / Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884198",
                    "Score": 0.601354718208313
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Similarly, this is the output when detecting medical conditions (ICD-10-CM concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "coronary artery disease",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9933860898017883,
            "BeginOffset": 90,
            "EndOffset": 113,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9682672023773193
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery without angina pectoris",
                    "Code": "I25.10",
                    "Score": 0.8199513554573059
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery",
                    "Code": "I25.1",
                    "Score": 0.4950370192527771
                },
                {
                    "Description": "Old myocardial infarction",
                    "Code": "I25.2",
                    "Score": 0.18753206729888916
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unstable angina pectoris",
                    "Code": "I25.110",
                    "Score": 0.16535982489585876
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unspecified angina pectoris",
                    "Code": "I25.119",
                    "Score": 0.15222692489624023
                }
            ]
        },
        {
            "Id": 2,
            "Text": "atrial fibrillation",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9923409223556519,
            "BeginOffset": 116,
            "EndOffset": 135,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9708861708641052
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified atrial fibrillation",
                    "Code": "I48.91",
                    "Score": 0.7011875510215759
                },
                {
                    "Description": "Chronic atrial fibrillation",
                    "Code": "I48.2",
                    "Score": 0.28612759709358215
                },
                {
                    "Description": "Paroxysmal atrial fibrillation",
                    "Code": "I48.0",
                    "Score": 0.21157972514629364
                },
                {
                    "Description": "Persistent atrial fibrillation",
                    "Code": "I48.1",
                    "Score": 0.16996538639068604
                },
                {
                    "Description": "Atrial premature depolarization",
                    "Code": "I49.1",
                    "Score": 0.16715925931930542
                }
            ]
        },
        {
            "Id": 3,
            "Text": "hypertension",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993137121200562,
            "BeginOffset": 138,
            "EndOffset": 150,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9734011888504028
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Essential (primary) hypertension",
                    "Code": "I10",
                    "Score": 0.6827990412712097
                },
                {
                    "Description": "Hypertensive heart disease without heart failure",
                    "Code": "I11.9",
                    "Score": 0.09846580773591995
                },
                {
                    "Description": "Hypertensive heart disease with heart failure",
                    "Code": "I11.0",
                    "Score": 0.09182810038328171
                },
                {
                    "Description": "Pulmonary hypertension, unspecified",
                    "Code": "I27.20",
                    "Score": 0.0866364985704422
                },
                {
                    "Description": "Primary pulmonary hypertension",
                    "Code": "I27.0",
                    "Score": 0.07662317156791687
                }
            ]
        },
        {
            "Id": 4,
            "Text": "hyperlipidemia",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9998835325241089,
            "BeginOffset": 153,
            "EndOffset": 167,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9702492356300354
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Hyperlipidemia, unspecified",
                    "Code": "E78.5",
                    "Score": 0.8378056883811951
                },
                {
                    "Description": "Disorders of lipoprotein metabolism and other lipidemias",
                    "Code": "E78",
                    "Score": 0.20186281204223633
                },
                {
                    "Description": "Lipid storage disorder, unspecified",
                    "Code": "E75.6",
                    "Score": 0.18514418601989746
                },
                {
                    "Description": "Pure hyperglyceridemia",
                    "Code": "E78.1",
                    "Score": 0.1438658982515335
                },
                {
                    "Description": "Other hyperlipidemia",
                    "Code": "E78.49",
                    "Score": 0.13983778655529022
                }
            ]
        },
        {
            "Id": 5,
            "Text": "chills",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9989762306213379,
            "BeginOffset": 211,
            "EndOffset": 217,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9510533213615417
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Chills (without fever)",
                    "Code": "R68.83",
                    "Score": 0.7460958361625671
                },
                {
                    "Description": "Fever, unspecified",
                    "Code": "R50.9",
                    "Score": 0.11848161369562149
                },
                {
                    "Description": "Typhus fever, unspecified",
                    "Code": "A75.9",
                    "Score": 0.07497859001159668
                },
                {
                    "Description": "Neutropenia, unspecified",
                    "Code": "D70.9",
                    "Score": 0.07332006841897964
                },
                {
                    "Description": "Lassa fever",
                    "Code": "A96.2",
                    "Score": 0.0721040666103363
                }
            ]
        },
        {
            "Id": 6,
            "Text": "nausea",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993392825126648,
            "BeginOffset": 220,
            "EndOffset": 226,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9175007939338684
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Nausea",
                    "Code": "R11.0",
                    "Score": 0.7333012819290161
                },
                {
                    "Description": "Nausea with vomiting, unspecified",
                    "Code": "R11.2",
                    "Score": 0.20183530449867249
                },
                {
                    "Description": "Hematemesis",
                    "Code": "K92.0",
                    "Score": 0.1203150525689125
                },
                {
                    "Description": "Vomiting, unspecified",
                    "Code": "R11.10",
                    "Score": 0.11658868193626404
                },
                {
                    "Description": "Nausea and vomiting",
                    "Code": "R11",
                    "Score": 0.11535880714654922
                }
            ]
        },
        {
            "Id": 8,
            "Text": "flank pain",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9315784573554993,
            "BeginOffset": 235,
            "EndOffset": 245,
            "Attributes": [
                {
                    "Type": "ACUITY",
                    "Score": 0.9809532761573792,
                    "RelationshipScore": 0.9999837875366211,
                    "Id": 7,
                    "BeginOffset": 229,
                    "EndOffset": 234,
                    "Text": "acute",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.8182812929153442
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified abdominal pain",
                    "Code": "R10.9",
                    "Score": 0.4959934949874878
                },
                {
                    "Description": "Generalized abdominal pain",
                    "Code": "R10.84",
                    "Score": 0.12332479655742645
                },
                {
                    "Description": "Lower abdominal pain, unspecified",
                    "Code": "R10.30",
                    "Score": 0.08319114148616791
                },
                {
                    "Description": "Upper abdominal pain, unspecified",
                    "Code": "R10.10",
                    "Score": 0.08275411278009415
                },
                {
                    "Description": "Jaw pain",
                    "Code": "R68.84",
                    "Score": 0.07797083258628845
                }
            ]
        },
        {
            "Id": 10,
            "Text": "numbness",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9659366011619568,
            "BeginOffset": 255,
            "EndOffset": 263,
            "Attributes": [
                {
                    "Type": "SYSTEM_ORGAN_SITE",
                    "Score": 0.9976192116737366,
                    "RelationshipScore": 0.9999089241027832,
                    "Id": 11,
                    "BeginOffset": 271,
                    "EndOffset": 274,
                    "Text": "leg",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.7310190796852112
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Anesthesia of skin",
                    "Code": "R20.0",
                    "Score": 0.767346203327179
                },
                {
                    "Description": "Paresthesia of skin",
                    "Code": "R20.2",
                    "Score": 0.13602739572525024
                },
                {
                    "Description": "Other complications of anesthesia",
                    "Code": "T88.59",
                    "Score": 0.09990577399730682
                },
                {
                    "Description": "Hypothermia following anesthesia",
                    "Code": "T88.51",
                    "Score": 0.09953102469444275
                },
                {
                    "Description": "Disorder of the skin and subcutaneous tissue, unspecified",
                    "Code": "L98.9",
                    "Score": 0.08736388385295868
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Available Now
You can use Amazon Comprehend Medical via the console, AWS Command Line Interface (CLI), or AWS SDKs. With Comprehend Medical, you pay only for what you use. You are charged based on the amount of text processed on a monthly basis, depending on the features you use. For more information, please see the Comprehend Medical section in the Comprehend Pricing page. Ontology Linking is available in all regions were Amazon Comprehend Medical is offered, as described in the AWS Regions Table.

The new ontology linking APIs make it easy to detect medications and medical conditions in unstructured clinical text and link them to RxNorm and ICD-10-CM codes respectively. This new feature can help you reduce the cost, time and effort of processing large amounts of unstructured medical text with high accuracy.

Danilo

New for Amazon Aurora – Use Machine Learning Directly From Your Databases

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Today, in order to use machine learning with data in a relational database, you need to develop a custom application to read the data from the database and then apply the machine learning model. Developing this application requires a mix of skills to be able to interact with the database and use machine learning. This is a new application, and now you have to manage its performance, availability, and security.

Can we make it easier to apply machine learning to data in a relational database? Even for existing applications?

Starting today, Amazon Aurora is natively integrated with two AWS machine learning services:

  • Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly.
  • Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text.

Using this new functionality, you can use a SQL function in your queries to apply a machine learning model to the data in your relational database. For example, you can detect the sentiment of a user comment using Comprehend, or apply a custom machine learning model built with SageMaker to estimate the risk of “churn” for your customers. Churn is a word mixing “change” and “turn” and is used to describe customers that stop using your services.

You can store the output of a large query including the additional information from machine learning services in a new table, or use this feature interactively in your application by just changing the SQL code run by the clients, with no machine learning experience required.

Let’s see a couple of examples of what you can do from an Aurora database, first by using Comprehend, then SageMaker.

Configuring Database Permissions
The first step is to give the database permissions to access the services you want to use: Comprehend, SageMaker, or both. In the RDS console, I create a new Aurora MySQL 5.7 database. When it is available, in the Connectivity & security tab of the regional endpoint, I look for the Manage IAM roles section.

There I connect Comprehend and SageMaker to this database cluster. For SageMaker, I need to provide the Amazon Resource Name (ARN) of the endpoint of a deployed machine learning model. If you want to use multiple endpoints, you need to repeat this step. The console takes care of creating the service roles for the Aurora database to access those services in order for the new machine learning integration to work.

Using Comprehend from Amazon Aurora
I connect to the database using a MySQL client. To run my tests, I create a table storing comments for a blogging platform and insert a few sample records:

CREATE TABLE IF NOT EXISTS comments (
       comment_id INT AUTO_INCREMENT PRIMARY KEY,
       comment_text VARCHAR(255) NOT NULL
);

INSERT INTO comments (comment_text)
VALUES ("This is very useful, thank you for writing it!");
INSERT INTO comments (comment_text)
VALUES ("Awesome, I was waiting for this feature.");
INSERT INTO comments (comment_text)
VALUES ("An interesting write up, please add more details.");
INSERT INTO comments (comment_text)
VALUES ("I don’t like how this was implemented.");

To detect the sentiment of the comments in my table, I can use the aws_comprehend_detect_sentiment and aws_comprehend_detect_sentiment_confidence SQL functions:

SELECT comment_text,
       aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment,
       aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence
  FROM comments;

The aws_comprehend_detect_sentiment function returns the most probable sentiment for the input text: POSITIVE, NEGATIVE, or NEUTRAL. The aws_comprehend_detect_sentiment_confidence function returns the confidence of the sentiment detection, between 0 (not confident at all) and 1 (fully confident).

Using SageMaker Endpoints from Amazon Aurora
Similarly to what I did with Comprehend, I can access a SageMaker endpoint to enrich the information stored in my database. To see a practical use case, let’s implement the customer churn example mentioned at the beginning of this post.

Mobile phone operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct a machine learning model. As input for the model, we’re looking at the current subscription plan, how much the customer is speaking on the phone at different times of day, and how often has called customer service.

Here’s the structure of my customer table:

SHOW COLUMNS FROM customers;

To be able to identify customers at risk of churn, I train a model following this sample SageMaker notebook using the XGBoost algorithm. When the model has been created, it’s deployed to a hosted endpoint.

When the SageMaker endpoint is in service, I go back to the Manage IAM roles section of the console to give the Aurora database permissions to access the endpoint ARN.

Now, I create a new will_churn SQL function giving input to the endpoint the parameters required by the model:

CREATE FUNCTION will_churn (
       state varchar(2048), acc_length bigint(20),
       area_code bigint(20), int_plan varchar(2048),
       vmail_plan varchar(2048), vmail_msg bigint(20),
       day_mins double, day_calls bigint(20),
       eve_mins double, eve_calls bigint(20),
       night_mins double, night_calls bigint(20),
       int_mins double, int_calls bigint(20),
       cust_service_calls bigint(20))
RETURNS varchar(2048) CHARSET latin1
       alias aws_sagemaker_invoke_endpoint
       endpoint name 'estimate_customer_churn_endpoint_version_123';

As you can see, the model looks at the customer’s phone subscription details and service usage patterns to identify the risk of churn. Using the will_churn SQL function, I run a query over my customers table to flag customers based on my machine learning model. To store the result of the query, I create a new customers_churn table:

CREATE TABLE customers_churn AS
SELECT *, will_churn(state, acc_length, area_code, int_plan,
       vmail_plan, vmail_msg, day_mins, day_calls,
       eve_mins, eve_calls, night_mins, night_calls,
       int_mins, int_calls, cust_service_calls) will_churn
  FROM customers;

Let’s see a few records from the customers_churn table:

SELECT * FROM customers_churn LIMIT 7;

I am lucky the first 7 customers are apparently not going to churn. But what happens overall? Since I stored the results of the will_churn function, I can run a SELECT GROUP BY statement on the customers_churn table.

SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;

Starting from there, I can dive deep to understand what brings my customers to churn.

If I create a new version of my machine learning model, with a new endpoint ARN, I can recreate the will_churn function without changing my SQL statements.

Available Now
The new machine learning integration is available today for Aurora MySQL 5.7, with the SageMaker integration generally available and the Comprehend integration in preview. You can learn more in the documentation. We are working on other engines and versions: Aurora MySQL 5.6 and Aurora PostgreSQL 10 and 11 are coming soon.

The Aurora machine learning integration is available in all regions in which the underlying services are available. For example, if both Aurora MySQL 5.7 and SageMaker are available in a region, then you can use the integration for SageMaker. For a complete list of services availability, please see the AWS Regional Table.

There’s no additional cost for using the integration, you just pay for the underlying services at your normal rates. Pay attention to the size of your queries when using Comprehend. For example, if you do sentiment analysis on user feedback in your customer service web page, to contact those who made particularly positive or negative comments, and people are making 10,000 comments a day, you’d pay $3/day. To optimize your costs, remember to store results.

It’s never been easier to apply machine learning models to data stored in your relational databases. Let me know what you are going to build with this!

Danilo

Introducing Batch Mode Processing for Amazon Comprehend Medical

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/introducing-batch-mode-processing-for-amazon-comprehend-medical/

Launched at AWS re:Invent 2018, Amazon Comprehend Medical is a HIPAA-eligible natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text.

For example, customers like Roche Diagnostics and The Fred Hutchinson Cancer Research Center can quickly and accurately extract information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records. They can also identify protected health information (PHI) present in these documents in order to anonymize it before data exchange.

In a previous blog post, I showed you how to use the Amazon Comprehend Medical API to extract entities and detect PHI in a single document. Today we’re happy to announce that this API can now process batches of documents stored in an Amazon Simple Storage Service (S3) bucket. Let’s do a demo!

Introducing the Batch Mode API
First, we need to grab some data to test batch mode: MT Samples is a great collection of real-life anonymized medical transcripts that are free to use and distribute. I picked a few transcripts, and converted them to the simple JSON format that Amazon Comprehend Medical expects: in a production workflow, converting documents to this format could easily be done by your application code, or by one of our analytics services such as AWS Glue.

{"Text": " VITAL SIGNS: The patient was afebrile. He is slightly tachycardic, 105,
but stable blood pressure and respiratory rate.GENERAL: The patient is in no distress.
Sitting quietly on the gurney. HEENT: Unremarkable. His oral mucosa is moist and well
hydrated. Lips and tongue look normal. Posterior pharynx is clear. NECK: Supple. His
trachea is midline.There is no stridor. LUNGS: Very clear with good breath sounds in
all fields. There is no wheezing. Good air movement in all lung fields.
CARDIAC: Without murmur. Slight tachycardia. ABDOMEN: Soft, nontender.
SKIN: Notable for a confluence erythematous, blanching rash on the torso as well
as more of a blotchy papular, macular rash on the upper arms. He noted some on his
buttocks as well. Remaining of the exam is unremarkable.}

Then, I simply upload the samples to an Amazon S3 bucket located in the same region as the service… and yes, ‘esophagogastroduodenoscopy’ is a word.

Now let’s head to the AWS console and create a entity detection job. The rest of the process would be identical for PHI.


Samples are stored under the ‘input/’ prefix, and I’m expecting results under the ‘output/’ prefix. Of course, you could use different buckets if you were so inclined. Optionally, you could also use AWS Key Management Service (KMS) to encrypt output results. For the sake of brevity, I won’t set up KMS here, but you’d certainly want to consider it for production workflows.

I also need to provide a data access role in AWS Identity and Access Management (IAM), allowing Amazon Comprehend Medical to access the relevant S3 bucket(s). You can use a role that you previously set up in AWS Identity and Access Management (IAM), or you can use the wizard in the Amazon Comprehend Medical console. For detailed information on permissions, please refer to the documentation.

Then, I create the batch job, and wait for it to complete. After a few minutes, the job is done.

Results are available at the output location: one output for each input, containing a JSON-formatted description of entities and their relationships.

A manifest also includes global information: number of processed documents, total amount of data, etc. Paths are edited out for clarity.

{
"Summary" : {
    "Status" : "COMPLETED",
    "JobType" : "EntitiesDetection",
    "InputDataConfiguration" : {
        "Bucket" : "jsimon-comprehend-medical-uswest2",
        "Path" : "input/"
    },
    "OutputDataConfiguration" : {
        "Bucket" : "jsimon-comprehend-medical-uswest2",
        "Path" : ...
    },
    "InputFileCount" : 4,
    "TotalMeteredCharacters" : 3636,
    "UnprocessedFilesCount" : 0,
    "SuccessfulFilesCount" : 4,
    "TotalDurationSeconds" : 366,
    "SuccessfulFilesListLocation" : ... ,
    "UnprocessedFilesListLocation" : ...
}
}

After retrieving the ‘rash.json.out‘ object from S3, I can use a JSON editor to view its contents. Here are some of the entities that have been detected.

Of course, this data is not meant to be read by humans. In a production workflow, it would be processed automatically by the Amazon Comprehend Medical APIs. Results would then stored in an AWS backend, and made available to healthcare professionals through a business application.

Now Available!
As you can see, it’s extremely easy to use Amazon Comprehend Medical in batch mode, even at very large scale. Zero machine learning work and zero infrastructure work required!

The service is available today in the following AWS regions:

  • US East (North Virginia), US East (Ohio), US West (Oregon),
  • Canada (Central),
  • EU (Ireland), EU (London),
  • Asia Pacific (Sydney).

The free tier covers 25,000 units of text (2.5 million characters) for the first three months when you start using the service, either with entity extraction or with PHI detection.

As always, we’d love to hear your feedback: please post it to the AWS forum for Amazon Comprehend, or send it through your usual AWS contacts.

Julien

Intuit: Serving Millions of Global Customers with Amazon Connect

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/intuit-serving-millions-of-global-customers-with-amazon-connect/

Recently, Bill Schuller, Intuit Contact Center Domain Architect met with AWS’s Simon Elisha to discuss how Intuit manages its customer contact centers with AWS Connect.

As a 35-year-old company with an international customer base, Intuit is widely known as the maker of Quick Books and Turbo Tax, among other software products. Its 50 million customers can access its global contact centers not just for password resets and feature explanations, but for detailed tax interpretation and advice. As you can imagine, this presents a challenge of scale.

Using Amazon Connect, a self-service, cloud-based contact center service, Intuit has been able to provide a seamless call-in experience to Intuit customers from around the globe. When a customer calls in to Amazon Connect, Intuit is able to do a “data dip” through AWS Lambda out to the company’s CRM system (in this case, SalesForce) in order to get more information from the customer. At this point, Intuit can leverage other services like Amazon Lex for national language feedback and then get the customer to the right person who can help. When the call is over, instead of having that important recording of the call locked up in a proprietary system, the audio is moved into an S3 bucket, where Intuit can do some post-call processing. It can also be sent it out to third parties for analysis, or Intuit can use Amazon Transcribe or Amazon Comprehend to get a transcription or sentiment analysis to understand more about what happened during that particular call.

Watch the video below to understand the reasons why Intuit decided on this set of AWS services (hint: it has to do with the ability to experiment with speed and scale but without the cost overhead).

*Check out more This Is My Architecture video series.

About the author

Annik StahlAnnik Stahl is a Senior Program Manager in AWS, specializing in blog and magazine content as well as customer ratings and satisfaction. Having been the face of Microsoft Office for 10 years as the Crabby Office Lady columnist, she loves getting to know her customers and wants to hear from you.