All posts by Julien Simon

Amazon Transcribe Now Supports Automatic Language Identification

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-transcribe-now-supports-automatic-language-identification/

In 2017, we launched Amazon Transcribe, an automatic speech recognition service that makes it easy for developers to add a speech-to-text capability to their applications. Since then, we added support for more languages, enabling customers globally to transcribe audio recordings in 31 languages, including 6 in real-time.

A popular use case for Amazon Transcribe is transcribing customer calls. This allows companies to analyze the transcribed text using natural language processing techniques to detect sentiment or to identify the most common call causes. If you operate in a country with multiple official languages or across multiple regions, your audio files can contain different languages. Thus, files have to be tagged manually with the appropriate language before transcription can take place. This typically involves setting up teams of multi-lingual speakers, which creates additional costs and delays in processing audio files.

The media and entertainment industry often uses Amazon Transcribe to convert media content into accessible and searchable text files. Use cases include generating subtitles or transcripts, moderating content, and more. Amazon Transcribe is also used by operations team for quality control, for example checking that audio and video are in sync thanks to the timestamps present in the extracted text. However, other problems couldn’t be easily solved, such as verifying that the main spoken language in your videos is correctly labeled to avoid streaming video in the wrong language.

Today, I’m extremely happy to announce that Amazon Transcribe can now automatically identify the dominant language in an audio recording. This feature will help customers build more efficient transcription workflows by getting rid of manual tagging. In addition to the examples mentioned above, you can now also easily use Amazon Transcribe to automatically recognize and transcribe voicemails, meetings, and any form of recorded communication.

Introducing Automatic Language Identification
With a minimum of 30 seconds of audio, Amazon Transcribe can efficiently generate transcripts in the spoken language without wasting time and resources on manual tagging. Automatic identification of the dominant language is available in batch transcription mode for all 31 languages. Thanks to sampling techniques, language identification happens much faster than the transcription itself, in the matter of seconds.

If you’re already using Amazon Transcribe for speech recognition, you just need to enable the feature in the StartTranscriptionJob API. Before your transcription job is complete, the response of the GetTranscriptionJob API will tell the dominant language of the audio recording, and its confidence score between 0 and 1. The transcript lists the top five languages and their respective confidence scores.

Of course, if you want to use Amazon Transcribe exclusively for automatic language identification, you can simply process the API response and ignore the transcript. In this case, you should stick to short 30-45 second audio recordings to minimize costs.

You can also restrict languages that Amazon Transcribe tries to identify, by passing a list of languages to the StartTranscriptionJob API. For example, if your company call center only receives calls in English, Spanish and French, then restricting identifiable languages to this list will increase language identification accuracy.

Now, I’d like to show you how easy it us to use this new feature!

Detecting the Dominant Language With Amazon Transcribe
First, let’s try a high quality sample. I’ll use the audio track from one of my breakout sessions at AWS Summit Paris 2019. I can easily download it using the youtube-dl tool.

$ youtube-dl -f bestaudio https://www.youtube.com/watch?v=AFN5jaTurfA
$ mv AWS\ \&\ EarthCube\ _\ Deep\ learning\ démarrer\ avec\ MXNet\ et\ Tensorflow\ en\ 10\ minutes-AFN5jaTurfA.m4a video.m4a

Using ffmpeg, I shorten the audio clip to 1 minute.

$ ffmpeg -i video.m4a -ss 00:00:00.00 -t 00:01:00.00 video-1mn.m4a

Then, I upload the clip to an Amazon Simple Storage Service (S3) bucket.

$ aws s3 cp video-1mn.m4a s3://jsimon-transcribe-uswest2/

Next, I use the AWS CLI to run a transcription job on this audio clip, with language identification enabled.

$ awscli transcribe start-transcription-job --transcription-job-name video-test --identify-language --media MediaFileUri=s3://jsimon-transcribe-uswest2/video-1mn.m4a

Waiting only a few seconds, I check the status of the job. I could also use a Amazon CloudWatch event to be notified that language identification is complete.

$ awscli transcribe get-transcription-job --transcription-job-name video-test
{
    "TranscriptionJob": {
        "TranscriptionJobName": "video-test",
        "TranscriptionJobStatus": "IN_PROGRESS",
        "LanguageCode": "fr-FR",
        "MediaSampleRateHertz": 44100,
        "MediaFormat": "mp4",
        "Media": {
        "MediaFileUri": "s3://jsimon-transcribe-uswest2/video-1mn.m4a"
    },
    "Transcript": {},
    "StartTime": 1593704323.312,
"CreationTime": 1593704323.287,

    "Settings": {
        "ChannelIdentification": false,
        "ShowAlternatives": false
    },
    "IdentifyLanguage": true,
    "IdentifiedLanguageScore": 0.915885329246521
    }
}

As highlighted in the output, the dominant language has been correctly detected in seconds, with a high confidence score of 91.59%. A few more seconds later, the transcription job is complete. Running the same CLI call, I can retrieve a link to the transcription, which also includes the top 5 languages for the audio clip, sorted by decreasing score.

"language_identification":[{"score":"0.9159","code":"fr-FR"},{"score":"0.0839","code":"fr-CA"},{"score":"0.0001","code":"en-GB"},{"score":"0.0001","code":"pt-PT"},{"score":"0.0001","code":"de-CH"}]

Adding up French and Canadian French, we pretty much get a score of 100%, so there’s no doubt that this clip is in French. In some cases, you may not care for that level of detail, and you’ll see in the next example how to restrict the list of detected languages.

Restricting the List of Detected Languages
As customer call transcription is a popular use case for Amazon Transcribe, here is a 40-second audio clip (WAV, 8KHz, 16-bit resolution), where I’m reading a paragraph from the French version of the Amazon Transcribe page. As you can hear, quality is pretty awful, and I added background music (Bach-ground, actually) for good measure.

Again, I upload the clip to an S3 bucket, and I use the AWS CLI to transcribe it. This time, I restrict the list of languages to French, Spanish, German, US English, and British English.

$ aws s3 cp speech-8k.wav s3://jsimon-transcribe-uswest2/
$ awscli transcribe start-transcription-job --transcription-job-name speech-8k-test --identify-language --media MediaFileUri=s3://jsimon-transcribe-uswest2/speech-8k.wav --language-options fr-FR es-ES de-DE en-US en-GB

A few seconds later, I check the status of the job.

$ awscli transcribe get-transcription-job --transcription-job-name speech-8k-test
{
    "TranscriptionJob": {
    "TranscriptionJobName": "speech-8k-test",
    "TranscriptionJobStatus": "IN_PROGRESS",
    "LanguageCode": "fr-FR",
    "MediaSampleRateHertz": 8000,
    "MediaFormat": "wav",
    "Media": {
        "MediaFileUri": "s3://jsimon-transcribe-uswest2/speech-8k.wav"
    },
    "Transcript": {},
    "StartTime": 1593705151.446,
"CreationTime": 1593705151.423,

    "Settings": {
        "ChannelIdentification": false,
        "ShowAlternatives": false
    },
    "IdentifyLanguage": true,
    "LanguageOptions": [
        "fr-FR","es-ES","de-DE","en-US","en-GB"
    ],
    "IdentifiedLanguageScore": 0.9995
    }
}

As highlighted in the output, the dominant language has been correctly detected with a very high confidence score in spite of the terrible audio quality. Restricting the list of languages certainly helps, and you should use it whenever possible.

Getting Started
Automatic Language Identification is available today in these regions:

  • US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), AWS GovCloud (US-West).
  • Canada (Central).
  • South America (São Paulo).
  • Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt).
  • Middle East (Bahrain).
  • Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney).

There is no additional charge on top of the existing pricing. Give it a try, and please send us feedback either through your usual AWS Support contacts, or on the AWS Forum for Amazon Transcribe.

– Julien

Amazon ECS Now Supports EC2 Inf1 Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-ecs-now-supports-ec2-inf1-instances/

As machine learning and deep learning models become more sophisticated, hardware acceleration is increasingly required to deliver fast predictions at high throughput. Today, we’re very happy to announce that AWS customers can now use the Amazon EC2 Inf1 instances on Amazon ECS, for high performance and the lowest prediction cost in the cloud. For a few weeks now, these instances have also been available on Amazon Elastic Kubernetes Service.

A primer on EC2 Inf1 instances
Inf1 instances were launched at AWS re:Invent 2019. They are powered by AWS Inferentia, a custom chip built from the ground up by AWS to accelerate machine learning inference workloads.

Inf1 instances are available in multiple sizes, with 1, 4, or 16 AWS Inferentia chips, with up to 100 Gbps network bandwidth and up to 19 Gbps EBS bandwidth. An AWS Inferentia chip contains four NeuronCores. Each one implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, saving I/O time in the process. When several AWS Inferentia chips are available on an Inf1 instance, you can partition a model across them and store it entirely in cache memory. Alternatively, to serve multi-model predictions from a single Inf1 instance, you can partition the NeuronCores of an AWS Inferentia chip across several models.

Compiling Models for EC2 Inf1 Instances
To run machine learning models on Inf1 instances, you need to compile them to a hardware-optimized representation using the AWS Neuron SDK. All tools are readily available on the AWS Deep Learning AMI, and you can also install them on your own instances. You’ll find instructions in the Deep Learning AMI documentation, as well as tutorials for TensorFlow, PyTorch, and Apache MXNet in the AWS Neuron SDK repository.

In the demo below, I will show you how to deploy a Neuron-optimized model on an ECS cluster of Inf1 instances, and how to serve predictions with TensorFlow Serving. The model in question is BERT, a state of the art model for natural language processing tasks. This is a huge model with hundreds of millions of parameters, making it a great candidate for hardware acceleration.

Creating an Amazon ECS Cluster
Creating a cluster is the simplest thing: all it takes is a call to the CreateCluster API.

$ aws ecs create-cluster --cluster-name ecs-inf1-demo

Immediately, I see the new cluster in the console.

New cluster

Several prerequisites are required before we can add instances to this cluster:

  • An AWS Identity and Access Management (IAM) role for ECS instances: if you don’t have one already, you can find instructions in the documentation. Here, my role is named ecsInstanceRole.
  • An Amazon Machine Image (AMI) containing the ECS agent and supporting Inf1 instances. You could build your own, or use the ECS-optimized AMI for Inferentia. In the us-east-1 region, its id is ami-04450f16e0cd20356.
  • A Security Group, opening network ports for TensorFlow Serving (8500 for gRPC, 8501 for HTTP). The identifier for mine is sg-0994f5c7ebbb48270.
  • If you’d like to have ssh access, your Security Group should also open port 22, and you should pass the name of an SSH key pair. Mine is called admin.

We also need to create a small user data file in order to let instances join our cluster. This is achieved by storing the name of the cluster in an environment variable, itself written to the configuration file of the ECS agent.

#!/bin/bash
echo ECS_CLUSTER=ecs-inf1-demo >> /etc/ecs/ecs.config

We’re all set. Let’s add a couple of Inf1 instances with the RunInstances API. To minimize cost, we’ll request Spot Instances.

$ aws ec2 run-instances \
--image-id ami-04450f16e0cd20356 \
--count 2 \
--instance-type inf1.xlarge \
--instance-market-options '{"MarketType":"spot"}' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ecs-inf1-demo}]' \
--key-name admin \
--security-group-ids sg-0994f5c7ebbb48270 \
--iam-instance-profile Name=ecsInstanceRole \
--user-data file://user-data.txt

Both instances appear right away in the EC2 console.

Inf1 instances

A couple of minutes later, they’re ready to run tasks on the cluster.

Inf1 instances

Our infrastructure is ready. Now, let’s build a container storing our BERT model.

Building a Container for Inf1 Instances
The Dockerfile is pretty straightforward:

  • Starting from an Amazon Linux 2 image, we open ports 8500 and 8501 for TensorFlow Serving.
  • Then, we add the Neuron SDK repository to the list of repositories, and we install a version of TensorFlow Serving that supports AWS Inferentia.
  • Finally, we copy our BERT model inside the container, and we load it at startup.

Here is the complete file.

FROM amazonlinux:2
EXPOSE 8500 8501
RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo
RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
RUN yum install -y tensorflow-model-server-neuron
COPY bert /bert
CMD ["/bin/sh", "-c", "/usr/local/bin/tensorflow_model_server_neuron --port=8500 --rest_api_port=8501 --model_name=bert --model_base_path=/bert/"]

Then, I build and push the container to a repository hosted in Amazon Elastic Container Registry. Business as usual.

$ docker build -t neuron-tensorflow-inference .

$ aws ecr create-repository --repository-name ecs-inf1-demo

$ aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

$ docker tag neuron-tensorflow-inference 123456789012.dkr.ecr.us-east-1.amazonaws.com/ecs-inf1-demo:latest

$ docker push

Now, we need to create a task definition in order to run this container on our cluster.

Creating a Task Definition for Inf1 Instances
If you don’t have one already, you should first create an execution role, i.e. a role allowing the ECS agent to perform API calls on your behalf. You can find more information in the documentation. Mine is called ecsTaskExecutionRole.

The full task definition is visible below. As you can see, it holds two containers:

  • The BERT container that I built,
  • A sidecar container called neuron-rtd, that allows the BERT container to access NeuronCores present on the Inf1 instance. The AWS_NEURON_VISIBLE_DEVICES environment variable lets you control which ones may be used by the container. You could use it to pin a container on one or several specific NeuronCores.
{
  "family": "ecs-neuron",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "entryPoint": [
        "sh",
        "-c"
      ],
      "portMappings": [
        {
          "hostPort": 8500,
          "protocol": "tcp",
          "containerPort": 8500
        },
        {
          "hostPort": 8501,
          "protocol": "tcp",
          "containerPort": 8501
        },
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "command": [
        "tensorflow_model_server_neuron --port=8500 --rest_api_port=8501 --model_name=bert --model_base_path=/bert"
      ],
      "cpu": 0,
      "environment": [
        {
          "name": "NEURON_RTD_ADDRESS",
          "value": "unix:/sock/neuron-rtd.sock"
        }
      ],
      "mountPoints": [
        {
          "containerPath": "/sock",
          "sourceVolume": "sock"
        }
      ],
      "memoryReservation": 1000,
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/ecs-inf1-demo:latest",
      "essential": true,
      "name": "bert"
    },
    {
      "entryPoint": [
        "sh",
        "-c"
      ],
      "portMappings": [],
      "command": [
        "neuron-rtd -g unix:/sock/neuron-rtd.sock"
      ],
      "cpu": 0,
      "environment": [
        {
          "name": "AWS_NEURON_VISIBLE_DEVICES",
          "value": "ALL"
        }
      ],
      "mountPoints": [
        {
          "containerPath": "/sock",
          "sourceVolume": "sock"
        }
      ],
      "memoryReservation": 1000,
      "image": "790709498068.dkr.ecr.us-east-1.amazonaws.com/neuron-rtd:latest",
      "essential": true,
      "linuxParameters": { "capabilities": { "add": ["SYS_ADMIN", "IPC_LOCK"] } },
      "name": "neuron-rtd"
    }
  ],
  "volumes": [
    {
      "name": "sock",
      "host": {
        "sourcePath": "/tmp/sock"
      }
    }
  ]
}

Finally, I call the RegisterTaskDefinition API to let the ECS backend know about it.

$ aws ecs register-task-definition --cli-input-json file://inf1-task-definition.json

We’re now ready to run our container, and predict with it.

Running a Container on Inf1 Instances
As this is a prediction service, I want to make sure that it’s always available on the cluster. Instead of simply running a task, I create an ECS Service that will make sure the required number of container copies is running, relaunching them should any failure happen.

$ aws ecs create-service --cluster ecs-inf1-demo \
--service-name bert-inf1 \
--task-definition ecs-neuron:1 \
--desired-count 1

A minute later, I see that both task containers are running on the cluster.

Running containers

Predicting with BERT on ECS and Inf1
The inner workings of BERT are beyond the scope of this post. This particular model expects a sequence of 128 tokens, encoding the words of two sentences we’d like to compare for semantic equivalence.

Here, I’m only interested in measuring prediction latency, so dummy data is fine. I build 100 prediction requests storing a sequence of 128 zeros. Using the IP address of the BERT container, I send them to the TensorFlow Serving endpoint via grpc, and I compute the average prediction time.

Here is the full code.

import numpy as np
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import time

if __name__ == '__main__':
    channel = grpc.insecure_channel('18.234.61.31:8500')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'bert'
    i = np.zeros([1, 128], dtype=np.int32)
    request.inputs['input_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['input_mask'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['segment_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))

    latencies = []
    for i in range(100):
        start = time.time()
        result = stub.Predict(request)
        latencies.append(time.time() - start)
        print("Inference successful: {}".format(i))
    print ("Ran {} inferences successfully. Latency average = {}".format(len(latencies), np.average(latencies)))

For convenience, I’m running this code on an EC2 instance based on the Deep Learning AMI. It comes pre-installed with a Conda environment for TensorFlow and TensorFlow Serving, saving me from installing any dependencies.

$ source activate tensorflow_p36
$ python predict.py

On average, prediction took 56.5ms. As far as BERT goes, this is pretty good!

Ran 100 inferences successfully. Latency average = 0.05647835493087769

Getting Started
You can now deploy Amazon Elastic Compute Cloud (EC2) Inf1 instances on Amazon ECS today in the US East (N. Virginia) and US West (Oregon) regions. As Inf1 deployment progresses, you’ll be able to use them with Amazon ECS in more regions.

Give this a try, and please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon ECS, or on the container roadmap on Github.

– Julien

Amazon Translate now supports Office documents

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-translate-now-supports-office-documents/

Whether your organization is a multinational enterprise present in many countries, or a small startup hungry for global success, translating your content to local languages may be an enduring challenge. Indeed, text data often comes in many formats, and processing them may require several different tools. Also, as all these tools may not support the same language pairs, you may have to convert certain documents to intermediate formats, or even resort to manual translation. All these issues add extra cost, and create unnecessary complexity in building consistent and automated translation workflows.

Amazon Translate aims at solving these problems in a simple and cost effective fashion. Using either the AWS console or a single API call, Amazon Translate makes it easy for AWS customers to quickly and accurately translate text in 55 different languages and variants.

Earlier this year, Amazon Translate introduced batch translation for plain text and HTML documents. Today, I’m very happy to announce that batch translation now also supports Office documents, namely .docx, .xlsx and .pptx files as defined by the Office Open XML standard.

Introducing Amazon Translate for Office Documents
The process is extremely simple. As you would expect, source documents have to be stored in an Amazon Simple Storage Service (S3) bucket. Please note that no document may be larger than 20 Megabytes, or have more than 1 million characters.

Each batch translation job processes a single file type and a single source language. Thus, we recommend that you organize your documents in a logical fashion in S3, storing each file type and each language under its own prefix.

Then, using either the AWS console or the StartTextTranslationJob API in one of the AWS language SDKs, you can launch a translation job, passing:

  • the input and output location in S3,
  • the file type,
  • the source and target languages.

Once the job is complete, you can collect translated files at the output location.

Let’s do a quick demo!

Translating Office Documents
Using the S3 console, I first upload a few .docx documents to one of my buckets.

S3 files

Then, moving to the Translate console, I create a new batch translation job, giving it a name, and selecting both the source and target languages.

Creating a batch job

Then, I define the location of my documents in S3, and their format, .docx in this case. Optionally, I could apply a custom terminology, to make sure specific words are translated exactly the way that I want.

Likewise, I define the output location for translated files. Please make sure that this path exists, as Translate will not create it for you.

Creating a batch job

Finally, I set the AWS Identity and Access Management (IAM) role, giving my Translate job the appropriate permissions to access S3. Here, I use an existing role that I created previously, and you can also let Translate create one for you. Then, I click on ‘Create job’ to launch the batch job.

Creating a batch job

The job starts immediately.

Batch job running

A little while later, the job is complete. All three documents have been translated successfully.

Viewing a completed job

Translated files are available at the output location, as visible in the S3 console.

Viewing translated files

Downloading one of the translated files, I can open it and compare it to the original version.

Comparing files

For small scale use, it’s extremely easy to use the AWS console to translate Office files. Of course, you can also use the Translate API to build automated workflows.

Automating Batch Translation
In a previous post, we showed you how to automate batch translation with an AWS Lambda function. You could expand on this example, and add language detection with Amazon Comprehend. For instance, here’s how you could combine the DetectDominantLanguage API with the Python-docx open source library to detect the language of .docx files.

import boto3, docx
from docx import Document

document = Document('blog_post.docx')
text = document.paragraphs[0].text
comprehend = boto3.client('comprehend')
response = comprehend.detect_dominant_language(Text=text)
top_language = response['Languages'][0]
code = top_language['LanguageCode']
score = top_language['Score']
print("%s, %f" % (code,score))

Pretty simple! You could also detect the type of each file based on its extension, and move it to the proper input location in S3. Then, you could schedule a Lambda function with CloudWatch Events to periodically translate files, and send a notification by email. Of course, you could use AWS Step Functions to build more elaborate workflows. Your imagination is the limit!

Getting Started
You can start translating Office documents today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Seoul).

If you’ve never tried Amazon Translate, did you know that the free tier offers 2 million characters per month for the first 12 months, starting from your first translation request?

Give it a try, and let us know what you think. We’re looking forward to your feedback: please post it to the AWS Forum for Amazon Translate, or send it to your usual AWS support contacts.

– Julien

New – Label Videos with Amazon SageMaker Ground Truth

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-label-videos-with-amazon-sagemaker-ground-truth/

Launched at AWS re:Invent 2018, Amazon Sagemaker Ground Truth is a capability of Amazon SageMaker that makes it easy to annotate machine learning datasets. Customers can efficiently and accurately label image, text and 3D point cloud data with built-in workflows, or any other type of data with custom workflows. Data samples are automatically distributed to a workforce (private, 3rd party or MTurk), and annotations are stored in Amazon Simple Storage Service (S3). Optionally, automated data labeling may also be enabled, reducing both the amount of time required to label the dataset, and the associated costs.

As models become more sophisticated, AWS customers are increasingly applying machine learning prediction to video content. Autonomous driving is perhaps the most well-known use case, as safety demands that road condition and moving objects be correctly detected and tracked in real-time. Video prediction is also a popular application in Sports, tracking players or racing vehicles to compute all kinds of statistics that fans are so fond of. Healthcare organizations also use video prediction to identify and track anatomical objects in medical videos. Manufacturing companies do the same to track objects on the assembly line, parcels for logistics, and more. The list goes on, and amazing applications keep popping up in many different industries.

Of course, this requires building and labeling video datasets, where objects of interest need to be labeled manually. At 30 frames per second, one minute of video translates to 1,800 individual images, so the amount of work can quickly become overwhelming. In addition, specific tools have to be built to label images, manage workflows, and so on. All this work takes valuable time and resources away from an organization’s core business.

AWS customers have asked us for a better solution, and today I’m very happy to announce that Amazon Sagemaker Ground Truth now supports video labeling.

Customer use case: the National Football League
The National Football League (NFL) has already put this new feature to work. Says Jennifer Langton, SVP of Player Health and Innovation, NFL: “At the National Football League (NFL), we continue to look for new ways to use machine learning (ML) to help our fans, broadcasters, coaches, and teams benefit from deeper insights. Building these capabilities requires large amounts of accurately labeled training data. Amazon SageMaker Ground Truth was truly a force multiplier in accelerating our project timelines. We leveraged the new video object tracking workflow in addition to other existing computer vision (CV) labeling workflows to develop labels for training a computer vision system that tracks all 22 players as they move on the field during plays. Amazon SageMaker Ground Truth reduced the timeline for developing a high quality labeling dataset by more than 80%”.

Courtesy of the NFL, here are a couple of predicted frames, showing helmet detection in a Seattle Seahawks video. This particular video has 353 frames. This first picture is frame #100.

Object tracking

This second picture is frame #110.

Object tracking

Introducing Video Labeling
With the addition of video task types, customers can now use Amazon Sagemaker Ground Truth for:

  • Video clip classification
  • Video multi-frame object detection
  • Video multi-frame object tracking

The multi-frame task types support multiple labels, so that you may label different object classes present in the video frames. You can create labeling jobs to annotate frames from scratch, as well as adjustment jobs to review and fine tune frames that have already been labeled. These jobs may be distributed either to a private workforce, or to a vendor workforce you picked on AWS Marketplace.

Using the built-in GUI, workers can then easily label and track objects across frames. Once they’ve annotated a frame, they can use an assistive labeling feature to predict the location of bounding boxes in the next frame, as you will see in the demo below. This significantly simplifies labeling work, saves time, and improves the quality of annotations. Last but not least, work is saved automatically.

Preparing Input Data for Video Object Detection and Tracking
As you would expect, input data must be located in S3. You may bring either video files, or sequences of video frames.

The first option is the simplest, as Amazon Sagemaker Ground Truth includes a tool that automatically extracts frames from your video files. Optionally, you can sample frames (1 in ‘n’), in order to reduce the amount of labeling work. The extraction tool also builds a manifest file describing sequences and frames. You can learn more about it in the documentation.

The second option requires two steps: extracting frames, and building the manifest file. Extracting frames can easily be performed with the popular ffmpeg open source tool. Here’s how you could convert the first 60 seconds of a video to a frame sequence.

$ ffmpeg -ss 00:00:00.00 -t 00:01:0.00 -i basketball.mp4 frame%04d.jpg

Each frame sequence should be uploaded to S3 under a different prefix, for example s3://my-bucket/my-videos/sequence1, s3://my-bucket/my-videos/sequence2, and so on, as explained in the documentation.

Once you have uploaded your frame sequences, you may then either bring your own JSON files to describe them, or let Ground Truth crawl your sequences and build the JSON files and the manifest file for you automatically. Please note that a video sequence cannot be longer than 2,000 frames, which corresponds to about a minute of video at 30 frames per second.

Each sequence should be described by a simple sequence file:

  • A sequence number, an S3 prefix, and a number of frames.
  • A list of frames: number, file name, and creation timestamp.

Here’s an example of a sequence file.

{"version": "2020-06-01",
"seq-no": 1, "prefix": "s3://jsimon-smgt/videos/basketball", "number-of-frames": 1800, 
	"frames": [
		{"frame-no": 1, "frame": "frame0001.jpg", "unix-timestamp": 1594111541.71155},
		{"frame-no": 2, "frame": "frame0002.jpg", "unix-timestamp": 1594111541.711552},
		{"frame-no": 3, "frame": "frame0003.jpg", "unix-timestamp": 1594111541.711553},
		{"frame-no": 4, "frame": "frame0004.jpg", "unix-timestamp": 1594111541.711555},
. . .

Finally, the manifest file should point at the sequence files you’d like to include in the labeling job. Here’s an example.

{"source-ref": "s3://jsimon-smgt/videos/seq1.json"}
{"source-ref": "s3://jsimon-smgt/videos/seq2.json"}
. . .

Just like for other task types, the augmented manifest is available in S3 once labeling is complete. It contains annotations and labels, which you can then feed to your machine learning training job.

Labeling Videos with Amazon SageMaker Ground Truth
Here’s a sample video where I label the first ten frames of a sequence. You can see a screenshot below.

I first use the Ground Truth GUI to carefully label the first frame, drawing bounding boxes for basketballs and basketball players. Then, I use the “Predict next” assistive labeling tool to predict the location of the boxes in the next nine frames, applying only minor adjustments to some boxes. Although this was my first try, I found the process easy and intuitive. With a little practice, I could certainly go much faster!

Getting Started
Now, it’s your turn. You can start labeling videos with Amazon Sagemaker Ground Truth today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (Oregon),
  • Canada (Central),
  • Europe (Ireland), Europe (London), Europe (Frankfurt),
  • Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Tokyo).

We’re looking forward to reading your feedback. You can send it through your usual support contacts, or in the AWS Forum for Amazon SageMaker.

– Julien

Amazon EKS Now Supports EC2 Inf1 Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-eks-now-supports-ec2-inf1-instances/

Amazon Elastic Kubernetes Service (EKS) has quickly become a leading choice for machine learning workloads. It combines the developer agility and the scalability of Kubernetes, with the wide selection of Amazon Elastic Compute Cloud (EC2) instance types available on AWS, such as the C5, P3, and G4 families.

As models become more sophisticated, hardware acceleration is increasingly required to deliver fast predictions at high throughput. Today, we’re very happy to announce that AWS customers can now use the Amazon EC2 Inf1 instances on Amazon Elastic Kubernetes Service, for high performance and the lowest prediction cost in the cloud.

A primer on EC2 Inf1 instances
Inf1 instances were launched at AWS re:Invent 2019. They are powered by AWS Inferentia, a custom chip built from the ground up by AWS to accelerate machine learning inference workloads.

Inf1 instances are available in multiple sizes, with 1, 4, or 16 AWS Inferentia chips, with up to 100 Gbps network bandwidth and up to 19 Gbps EBS bandwidth. An AWS Inferentia chip contains four NeuronCores. Each one implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, saving I/O time in the process. When several AWS Inferentia chips are available on an Inf1 instance, you can partition a model across them and store it entirely in cache memory. Alternatively, to serve multi-model predictions from a single Inf1 instance, you can partition the NeuronCores of an AWS Inferentia chip across several models.

Compiling Models for EC2 Inf1 Instances
To run machine learning models on Inf1 instances, you need to compile them to a hardware-optimized representation using the AWS Neuron SDK. All tools are readily available on the AWS Deep Learning AMI, and you can also install them on your own instances. You’ll find instructions in the Deep Learning AMI documentation, as well as tutorials for TensorFlow, PyTorch, and Apache MXNet in the AWS Neuron SDK repository.

In the demo below, I will show you how to deploy a Neuron-optimized model on an EKS cluster of Inf1 instances, and how to serve predictions with TensorFlow Serving. The model in question is BERT, a state of the art model for natural language processing tasks. This is a huge model with hundreds of millions of parameters, making it a great candidate for hardware acceleration.

Building an EKS Cluster of EC2 Inf1 Instances
First of all, let’s build a cluster with two inf1.2xlarge instances. I can easily do this with eksctl, the command-line tool to provision and manage EKS clusters. You can find installation instructions in the EKS documentation.

Here is the configuration file for my cluster. Eksctl detects that I’m launching a node group with an Inf1 instance type, and will start your worker nodes using the EKS-optimized Accelerated AMI.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: cluster-inf1
  region: us-west-2
nodeGroups:
  - name: ng1-public
    instanceType: inf1.2xlarge
    minSize: 0
    maxSize: 3
    desiredCapacity: 2
    ssh:
      allow: true

Then, I use eksctl to create the cluster. This process will take approximately 10 minutes.

$ eksctl create cluster -f inf1-cluster.yaml

Eksctl automatically installs the Neuron device plugin in your cluster. This plugin advertises Neuron devices to the Kubernetes scheduler, which can be requested by containers in a deployment spec. I can check with kubectl that the device plug-in container is running fine on both Inf1 instances.

$ kubectl get pods -n kube-system
NAME                                  READY STATUS  RESTARTS AGE
aws-node-tl5xv                        1/1   Running 0        14h
aws-node-wk6qm                        1/1   Running 0        14h
coredns-86d5cbb4bd-4fxrh              1/1   Running 0        14h
coredns-86d5cbb4bd-sts7g              1/1   Running 0        14h
kube-proxy-7px8d                      1/1   Running 0        14h
kube-proxy-zqvtc                      1/1   Running 0        14h
neuron-device-plugin-daemonset-888j4  1/1   Running 0        14h
neuron-device-plugin-daemonset-tq9kc  1/1   Running 0        14h

Next, I define AWS credentials in a Kubernetes secret. They will allow me to grab my BERT model stored in S3. Please note that both keys needs to be base64-encoded.

apiVersion: v1 
kind: Secret 
metadata: 
  name: aws-s3-secret 
type: Opaque 
data: 
  AWS_ACCESS_KEY_ID: <base64-encoded value> 
  AWS_SECRET_ACCESS_KEY: <base64-encoded value>

Finally, I store these credentials on the cluster.

$ kubectl apply -f secret.yaml

The cluster is correctly set up. Now, let’s build an application container storing a Neuron-enabled version of TensorFlow Serving.

Building an Application Container for TensorFlow Serving
The Dockerfile is very simple. We start from an Amazon Linux 2 base image. Then, we install the AWS CLI, and the TensorFlow Serving package available in the Neuron repository.

FROM amazonlinux:2
RUN yum install -y awscli
RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo
RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
RUN yum install -y tensorflow-model-server-neuron

I build the image, create an Amazon Elastic Container Registry repository, and push the image to it.

$ docker build . -f Dockerfile -t tensorflow-model-server-neuron
$ docker tag IMAGE_NAME 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo
$ aws ecr create-repository --repository-name inf1-demo
$ docker push 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo

Our application container is ready. Now, let’s define a Kubernetes service that will use this container to serve BERT predictions. I’m using a model that has already been compiled with the Neuron SDK. You can compile your own using the instructions available in the Neuron SDK repository.

Deploying BERT as a Kubernetes Service
The deployment manages two containers: the Neuron runtime container, and my application container. The Neuron runtime runs as a sidecar container image, and is used to interact with the AWS Inferentia chips. At startup, the latter configures the AWS CLI with the appropriate security credentials. Then, it fetches the BERT model from S3. Finally, it launches TensorFlow Serving, loading the BERT model and waiting for prediction requests. For this purpose, the HTTP and grpc ports are open. Here is the full manifest.

kind: Service
apiVersion: v1
metadata:
  name: eks-neuron-test
  labels:
    app: eks-neuron-test
spec:
  ports:
  - name: http-tf-serving
    port: 8500
    targetPort: 8500
  - name: grpc-tf-serving
    port: 9000
    targetPort: 9000
  selector:
    app: eks-neuron-test
    role: master
  type: ClusterIP
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: eks-neuron-test
  labels:
    app: eks-neuron-test
    role: master
spec:
  replicas: 2
  selector:
    matchLabels:
      app: eks-neuron-test
      role: master
  template:
    metadata:
      labels:
        app: eks-neuron-test
        role: master
    spec:
      volumes:
        - name: sock
          emptyDir: {}
      containers:
      - name: eks-neuron-test
        image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo:latest
        command: ["/bin/sh","-c"]
        args:
          - "mkdir ~/.aws/ && \
           echo '[eks-test-profile]' > ~/.aws/credentials && \
           echo AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID >> ~/.aws/credentials && \
           echo AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY >> ~/.aws/credentials; \
           /usr/bin/aws --profile eks-test-profile s3 sync s3://jsimon-inf1-demo/bert /tmp/bert && \
           /usr/local/bin/tensorflow_model_server_neuron --port=9000 --rest_api_port=8500 --model_name=bert_mrpc_hc_gelus_b4_l24_0926_02 --model_base_path=/tmp/bert/"
        ports:
        - containerPort: 8500
        - containerPort: 9000
        imagePullPolicy: Always
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              key: AWS_ACCESS_KEY_ID
              name: aws-s3-secret
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: AWS_SECRET_ACCESS_KEY
              name: aws-s3-secret
        - name: NEURON_RTD_ADDRESS
          value: unix:/sock/neuron.sock

        resources:
          limits:
            cpu: 4
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
          - name: sock
            mountPath: /sock

      - name: neuron-rtd
        image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-rtd:1.0.6905.0
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - IPC_LOCK

        volumeMounts:
          - name: sock
            mountPath: /sock
        resources:
          limits:
            hugepages-2Mi: 256Mi
            aws.amazon.com/neuron: 1
          requests:
            memory: 1024Mi

I use kubectl to create the service.

$ kubectl create -f bert_service.yml

A few seconds later, the pods are up and running.

$ kubectl get pods
NAME                           READY STATUS  RESTARTS AGE
eks-neuron-test-5d59b55986-7kdml 2/2   Running 0        14h
eks-neuron-test-5d59b55986-gljlq 2/2   Running 0        14h

Finally, I redirect service port 9000 to local port 9000, to let my prediction client connect locally.

$ kubectl port-forward svc/eks-neuron-test 9000:9000 &

Now, everything is ready for prediction, so let’s invoke the model.

Predicting with BERT on EKS and Inf1
The inner workings of BERT are beyond the scope of this post. This particular model expects a sequence of 128 tokens, encoding the words of two sentences we’d like to compare for semantic equivalence.

Here, I’m only interested in measuring prediction latency, so dummy data is fine. I build 100 prediction requests storing a sequence of 128 zeros. I send them to the TensorFlow Serving endpoint via grpc, and I compute the average prediction time.

import numpy as np
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import time

if __name__ == '__main__':
    channel = grpc.insecure_channel('localhost:9000')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'bert_mrpc_hc_gelus_b4_l24_0926_02'
    i = np.zeros([1, 128], dtype=np.int32)
    request.inputs['input_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['input_mask'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['segment_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))

    latencies = []
    for i in range(100):
        start = time.time()
        result = stub.Predict(request)
        latencies.append(time.time() - start)
        print("Inference successful: {}".format(i))
    print ("Ran {} inferences successfully. Latency average = {}".format(len(latencies), np.average(latencies)))

On average, prediction took 5.92ms. As far as BERT goes, this is pretty good!

Ran 100 inferences successfully. Latency average = 0.05920819044113159

In real-life, we would certainly be batching prediction requests in order to increase throughput. If needed, we could also scale to larger Inf1 instances supporting several Inferentia chips, and deliver even more prediction performance at low cost.

Getting Started
Kubernetes users can deploy Amazon Elastic Compute Cloud (EC2) Inf1 instances on Amazon Elastic Kubernetes Service today in the US East (N. Virginia) and US West (Oregon) regions. As Inf1 deployment progresses, you’ll be able to use them with Amazon Elastic Kubernetes Service in more regions.

Give this a try, and please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon Elastic Kubernetes Service, or on the container roadmap on Github.

– Julien

New – Label 3D Point Clouds with Amazon SageMaker Ground Truth

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-label-3d-point-clouds-with-amazon-sagemaker-ground-truth/

Launched at AWS re:Invent 2018, Amazon Sagemaker Ground Truth is a capability of Amazon SageMaker that makes it easy to annotate machine learning datasets. Customers can efficiently and accurately label image and text data with built-in workflows, or any other type of data with custom workflows. Data samples are automatically distributed to a workforce (private, 3rd party or MTurk), and annotations are stored in Amazon Simple Storage Service (S3). Optionally, automated data labeling may also be enabled, reducing both the amount of time required to label the dataset, and the associated costs.

About a year ago, I met with Automotive customers who expressed interest in labeling 3-dimensional (3D) datasets for autonomous driving. Captured by LIDAR sensors, these datasets are particularly large and complex. Data is stored in frames that typically contain 50,000 to 5 million points, and can weigh up to hundreds of Megabytes each. Frames are either stored individually, or in sequences that make it easier to track moving objects.

As you can imagine, labeling these datasets is extremely time-consuming, as workers need to navigate complex 3D scenes and annotate many different object classes. This often requires building and managing very complex tools. Always looking to help customers build simpler and more efficient workflows, the Ground Truth team gathered more feedback, and got to work.

Today, I’m extremely happy to announce that you can use Amazon Sagemaker Ground Truth to label 3D point clouds using a built-in editor, and state-of-the-art assistive labeling features.

Introducing 3D Point Cloud Labeling
Just like for other Ground Truth tasks types, input data for 3D point clouds has to be stored in an S3 bucket. It also needs to be described by a manifest file, a JSON file containing both the location of frames in S3 and their attributes. A dataset may contain either single-frame data, or multi-frame sequences.

Optionally, the dataset may also include image data captured by on-board cameras. Using a feature called “sensor fusion”, Ground Truth can synchronize a 3D point cloud with up to 8 cameras. Thanks to this, workers get a real-life view of the scene, and they can also interchangeably apply labels to 2D images and 3D point clouds.

Once the manifest file is ready, Ground Truth lets you create the following task types:

  • Object Detection: identify objects of interest within a 3D point cloud frame.
  • Object Tracking: track objects of interest across a sequence of 3D point cloud frames.
  • Semantic Segmentation: segment the points of a 3D point cloud frame into predefined categories.

These can either be labeling jobs where workers annotate new frames, or adjustment jobs where they review and fine-tune existing annotations. Jobs may be distributed either to a private workforce or to a vendor workforce you picked on AWS Marketplace.

Using the built-in graphical user interface (GUI) and its shortcuts for navigation and labeling, workers can quickly and accurately apply labels, boxes and categories to 3D objects (“car”, “pedestrian”, and so on). They can also add user-defined attributes, such as the color of a car, or whether an object is fully or partially visible.

The GUI includes many assistive labeling features that significantly simplify labeling work, save time, and improve the quality of annotations. Here are a few examples:

  • Snapping: Ground Truth infers a tight-fitting box around the object.
  • Interpolation: the labeler annotates an object in the first and last frames of a sequence. Ground Truth automatically annotates it in the middle frames.
  • Ground detection and removal: Ground Truth can automatically detect and remove 3D points belonging to the ground from object boxes.

Even with assistive labeling, it may take a while to annotate complex frames and sequences, so work is saved periodically to avoid any data loss.

Preparing 3D Point Cloud Datasets
As previously mentioned, you have to provide a manifest file describing your 3D dataset. The format of this file is defined in the Ground Truth documentation. Of course, the steps required to build it will vary from one dataset to the next. For example, the Audi A2D2 dataset contains almost 400,000 frames, with 360-degree 3D LIDAR data and 2D images. KITTI, another popular choice for autonomous driving research, includes a 3D dataset with 15,000 images and their corresponding point clouds, for a total of 80,256 labeled objects. This notebook shows you how to convert KITTI data to the Ground Truth format.

When datasets contain both 3D LIDAR data and 2D camera images, one challenge is to synchronize them. This allows us to project 3D points to 2D coordinates, map them on the pictures captured by on-board cameras, and vice versa. Another challenge is that data captured by a given device uses coordinates local to this device. Fortunately, we know where the device is located on the car, and where it’s pointed to. All of this can be solved by building a global coordinate system, also known as a World Coordinate System (WCS). Using matrix operations (which I’ll spare you), we can compute the coordinates of all data points inside the WCS.

Once frames have been processed, their information is saved in the manifest file: the position of the vehicle, the location of LIDAR data in S3, the location of associated pictures in S3, and so on. For large datasets, the whole process is a significant workload, and you could run it on a managed service such as Amazon SageMaker Processing, Amazon EMR or AWS Glue.

Labeling 3D Point Clouds with Amazon SageMaker Ground Truth
Let’s do a quick demo, based on this notebook. Starting from pre-processed sample frames, it streamlines the process of creating a 3D point cloud labeling job for each of the six task types (Object Detection, Object Tracking, Semantic Segmentation, and the associated adjustment task types). You can easily make yourself a private worker, and start labeling frames with the worker GUI and its labeling tools.

A picture is worth a thousand words, and a video even more! In this first video, I annotate a couple of cars using two assistive labeling features. First, I fit the box to the ground, which helps me capture object points that are close to the ground without actually capturing the ground itself. Second, I fit the box to the object, which ensures a tight fit without any blank space.

Amazon SageMaker Ground Truth

In this second video, I annotate a third car using the same technique. It’s quite harder to “see” than the previous ones, but I still manage to fit a tight box around it. Playing the next nine frames, I see that this car is actually moving. Jumping directly to the tenth frame, I adjust the bounding box to the new location of the car. Ground Truth automatically labels the eight middle frames, another assistive labeling feature called interpolation.

Amazon SageMaker Ground Truth

I’ve barely scratched the surface, and there’s plenty more to learn. Now it’s your turn!

Getting Started
You can start labeling 3D point clouds with Amazon Sagemaker Ground Truth today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (Oregon),
  • Canada (Central),
  • Europe (Ireland), Europe (London), Europe (Frankfurt),
  • Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Tokyo).

We’re looking forward to reading your feedback. You can send it through your usual support contacts, or in the AWS Forum for Amazon SageMaker.

– Julien

Reinventing Enterprise Search – Amazon Kendra is Now Generally Available

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/reinventing-enterprise-search-amazon-kendra-is-now-generally-available/

At the end of 2019, we launched a preview version of Amazon Kendra, a highly accurate and easy to use enterprise search service powered by machine learning. Today, I’m very happy to announce that Amazon Kendra is now generally available.

For all its amazing achievements in past decades, Information Technology had yet to solve a problem that all of us face every day: quickly and easily finding the information we need. Whether we’re looking for the latest version of the company travel policy, or asking a more technical question like “what’s the tensile strength of epoxy adhesives?”, we never seem to be able to get the correct answer right away. Sometimes, we never get it at all!

Not only are these issues frustrating for users, they’re also responsible for major productivity losses. According to an IDC study, the cost of inefficient search is $5,700 per employee per year: for a 1,000-employee company, $5.7 million evaporate every year, not counting the liability and compliance risks imposed by low accuracy search.

This problem has several causes. First, most enterprise data is unstructured, making it difficult to pinpoint the information you need. Second, data is often spread across organizational silos, and stored in heterogeneous backends: network shares, relational databases, 3rd party applications, and more. Lastly, keyword-based search systems require figuring out the right combination of keywords, and usually return a large number of hits, most of them irrelevant to our query.

Taking note of these pain points, we decided to help our customers build the search capabilities that they deserve. The result of this effort is Amazon Kendra.

Introducing Amazon Kendra
With just a few clicks, Amazon Kendra enables organizations to index structured and unstructured data stored in different backends, such as file systems, applications, Intranet, and relational databases. As you would expect, all data is encrypted in flight using HTTPS, and can be encrypted at rest with AWS Key Management Service (KMS).

Amazon Kendra is optimized to understand complex language from domains like IT (e.g. “How do I set up my VPN?”), healthcare and life sciences (e.g. “What is the genetic marker for ALS?”), and many other domains. This multi-domain expertise allows Kendra to find more accurate answers. In addition, developers can explicitly tune the relevance of results, using criteria such as authoritative data sources or document freshness.

Kendra search can be quickly deployed to any application (search page, chat apps, chatbots, etc.) via the code samples available in the AWS console, or via APIs. Customers can be up and running with state the art semantic search from Kendra in minutes.

Many organizations already use Amazon Kendra today. For example, the Allen Institute is committed to solving some of the biggest mysteries of bioscience, researching the unknown of human biology, in the brain, the human cell and the immune system. Says Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI: “One of the most impactful things AI like Amazon Kendra can do right now is help scientists, academics, and technologists quickly find the right information in a sea of scientific literature and move important research faster. The Semantic Scholar team at Allen Institute for AI, along with our partners, is proud to provide CORD-19 and to support the AI resources the community is building to leverage this resource to tackle this crucial problem”.

Introducing New Features in Amazon Kendra
Based on customer feedback collected during the preview phase, we added the following features to Amazon Kendra.

  • New scaling options for the Enterprise Edition, as well as a newly-introduced Developer Edition (see details below).
  • 3 new Cloud Connectors: OneDrive, Salesforce, and ServiceNow (in addition to S3, RDS, and SharePoint Online).
  • Expertise on 8 new domains: Automotive, Health, HR, Legal, Media and Entertainment, News, Telecom, Travel and Leisure (in addition to Chemical, Energy, Finance, Insurance, IT, and Pharmaceuticals).
  • Faster indexing, and improved accuracy.

Indexing Data with Amazon Kendra
For the purpose of this demo, I downloaded a small subset of Wikipedia (about 50,000 web pages). I uploaded the individual files in HTML format to an Amazon Simple Storage Service (S3) bucket.

Heading out to the Kendra console, I start by creating a new index, giving it a name and a description. One click is all it takes to enable encryption with AWS Key Management Service (KMS).

After 30 minutes or so, the index is in service. I can now add data sources to it.

Adding my S3 bucket is extremely easy. I first enter a name for the data source.

Then, I define the name of the S3 bucket. I also need to specify the name of the IAM role used by Kendra, either selecting an existing role or creating a new one.

I’m given the choice to schedule synchronization at periodic intervals, in order to refresh the index with new data added to the data source. I go for a daily refresh running at midnight.

The next screen lets me review all parameters, and create the data source. Once it’s active, I launch the initial synchronization by clicking on the “Sync now” button.

After a little while, synchronization is complete. Moving to the test window, I can now start running queries on the index.

Querying Data with Amazon Kendra
While working on one of my posts the other day, I listened to a Jazz song that I really liked, played by a musician named Thad Jones. Knowing absolutely nothing about Jazz players, I’m curious whether Kendra can help me learn more.

Unsurprisingly, this query matches a large number of documents. However, Kendra comes up with a suggested answer, a high confidence answer to my query. It points at a specific paragraph in one of indexed pages. Relevant content is highlighted for more convenience, and I can immediately see that this is the right answer to my query. No need to look any further! Accordingly, I give it a thumbs up so that Amazon Kendra knows that this is indeed a good answer.

Looking to learn more about Thad Jones, I ask a second question.

Once again, I get a suggested answer. This time, Kendra went one step further by returning the exact answer from the document, instead of just returning the document itself. This shows how Kendra is able to understand context and extract relationships, in this case the link between an individual and their city of birth.

Still curious, I ask a third question.

I get another suggested answer, and it’s once again right on target. The information I’m looking for is in the first sentence: Thad Jones has played with Count Basie. As you can see, the paragraph above doesn’t even include the word “play”. Yet, Amazon Kendra interpreted my question correctly. Thad Jones is a musician: if I’m asking about him playing with someone else, it’s very likely that I’m looking for other musicians, not for sport partners! This ability to understand natural language queries and to extract deep domain knowledge is what makes Amazon Kendra so accurate.

Getting Started
Amazon Kendra is available today in US East (N. Virginia), US West (Oregon), and Europe (Ireland).

You can pick one of two editions.

The Enterprise Edition lets you search up to 500,000 documents, and run up to 40,000 queries per day for $7 per hour. You will also be charged $0.000001 per document scanned, and $0.35 per hour per connector when syncing. If you need more indexing or querying capacity, you can now scale each independently: $3.5 per hour for additional 40,000 queries, and $3.5 per hour for additional 500,000 searchable documents.

The Developer Edition has the same features as the Enterprise Edition. However, it’s limited to 4,000 queries per day, on up to 10,000 searchable documents across 5 data sources. No scaling options are available. Please note that the Developer Edition runs on a single availability zone, which is why it shouldn’t be used for production purposes.

Please give Amazon Kendra a try! We’d love to get your feedback, either through your usual AWS Support contacts, or on the AWS Forum for Kendra.

– Julien

Announcing TorchServe, An Open Source Model Server for PyTorch

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/announcing-torchserve-an-open-source-model-server-for-pytorch/

PyTorch is one of the most popular open source libraries for deep learning. Developers and researchers particularly enjoy the flexibility it gives them in building and training models. Yet, this is only half the story, and deploying and managing models in production is often the most difficult part of the machine learning process: building bespoke prediction APIs, scaling them, securing them, etc.

One way to simplify the model deployment process is to use a model server, i.e. an off-the-shelf web application specially designed to serve machine learning predictions in production. Model servers make it easy to load one or several models, automatically creating a prediction API backed by a scalable web server. They’re also able to run preprocessing and postprocessing code on prediction requests. Last but not least, model servers also provide production-critical features like logging, monitoring, and security. Popular model servers include TensorFlow Serving and the Multi Model Server.

Today, I’m extremely happy to announce TorchServe, a PyTorch model serving library that makes it easy to deploy trained PyTorch models at scale without having to write custom code.

Introducing TorchServe
TorchServe is a collaboration between AWS and Facebook, and it’s available as part of the PyTorch open source project. If you’re interested in how the project was initiated, you can read the initial RFC on Github.

With TorchServe, PyTorch users can now bring their models to production quicker, without having to write custom code: on top of providing a low latency prediction API, TorchServe embeds default handlers for the most common applications such as object detection and text classification. In addition, TorchServe includes multi-model serving, model versioning for A/B testing, monitoring metrics, and RESTful endpoints for application integration. As you would expect, TorchServe supports any machine learning environment, including Amazon SageMaker, container services, and Amazon Elastic Compute Cloud (EC2).

Several customers are already enjoying the benefits of TorchServe.

Toyota Research Institute Advanced Development, Inc. (TRI-AD) is developing software for automated driving at Toyota Motor Corporation. Says Yusuke Yachide, Lead of ML Tools at TRI-AD: “we continuously optimize and improve our computer vision models, which are critical to TRI-AD’s mission of achieving safe mobility for all with autonomous driving. Our models are trained with PyTorch on AWS, but until now PyTorch lacked a model serving framework. As a result, we spent significant engineering effort in creating and maintaining software for deploying PyTorch models to our fleet of vehicles and cloud servers. With TorchServe, we now have a performant and lightweight model server that is officially supported and maintained by AWS and the PyTorch community”.

Matroid is a maker of computer vision software that detects objects and events in video footage. Says Reza Zadeh, Founder and CEO at Matroid Inc.: “we develop a rapidly growing number of machine learning models using PyTorch on AWS and on-premise environments. The models are deployed using a custom model server that requires converting the models to a different format, which is time-consuming and burdensome. TorchServe allows us to simplify model deployment using a single servable file that also serves as the single source of truth, and is easy to share and manage”.

Now, I’d like to show you how to install TorchServe, and load a pretrained model on Amazon Elastic Compute Cloud (EC2). You can try other environments by following the documentation.

Installing TorchServe
First, I fire up a CPU-based Amazon Elastic Compute Cloud (EC2) instance running the Deep Learning AMI (Ubuntu edition). This AMI comes preinstalled with several dependencies that I’ll need, which will speed up setup. Of course you could use any AMI instead.

TorchServe is implemented in Java, and I need the latest OpenJDK to run it.

sudo apt install openjdk-11-jdk

Next, I create and activate a new Conda environment for TorchServe. This will keep my Python packages nice and tidy (virtualenv works too, of course).

conda create -n torchserve

source activate torchserve

Next, I install dependencies for TorchServe.

pip install sentencepiece       # not available as a Conda package

conda install psutil pytorch torchvision torchtext -c pytorch

If you’re using a GPU instance, you’ll need an extra package.

conda install cudatoolkit=10.1

Now that dependencies are installed, I can clone the TorchServe repository, and install TorchServe.

git clone https://github.com/pytorch/serve.git

cd serve

pip install .

cd model-archiver

pip install .

Setup is complete, let’s deploy a model!

Deploying a Model
For the sake of this demo, I’ll simply download a pretrained model from the PyTorch model zoo. In real life, you would probably use your own model.

wget https://download.pytorch.org/models/densenet161-8d451a50.pth

Next, I need to package the model into a model archive. A model archive is a ZIP file storing all model artefacts, i.e. the model itself (densenet161-8d451a50.pth), a Python script to load the state dictionary (matching tensors to layers), and any extra file you may need. Here, I include a file named index_to_name.json, which maps class identifiers to class names. This will be used by the built-in image_classifier handler, which is in charge of the prediction logic. Other built-in handlers are available (object_detector, text_classifier, image_segmenter), and you can implement your own.

torch-model-archiver --model-name densenet161 --version 1.0 \
--model-file examples/image_classifier/densenet_161/model.py \
--serialized-file densenet161-8d451a50.pth \
--extra-files examples/image_classifier/index_to_name.json \
--handler image_classifier

Next, I create a directory to store model archives, and I move the one I just created there.

mkdir model_store

mv densenet161.mar model_store/

Now, I can start TorchServe, pointing it at the model store and at the model I want to load. Of course, I could load several models if needed.

torchserve --start --model-store model_store --models densenet161=densenet161.mar

Still on the same machine, I grab an image and easily send it to TorchServe for local serving using an HTTP POST request. Note the format of the URL, which includes the name of the model I want to use.

curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg

curl -X POST http://127.0.0.1:8080/predictions/densenet161 -T kitten.jpg

The result appears immediately. Note that class names are visible, thanks to the built-in handler.

[
{"tiger_cat": 0.4693356156349182},
{"tabby": 0.46338796615600586},
{"Egyptian_cat": 0.06456131488084793},
{"lynx": 0.0012828155886381865},
{"plastic_bag": 0.00023323005007114261}
]

I then stop TorchServe with the ‘stop‘ command.

torchserve --stop

As you can see, it’s easy to get started with TorchServe using the default configuration. Now let me show you how to set it up for remote serving.

Configuring TorchServe for Remote Serving
Let’s create a configuration file for TorchServe, named config.properties (the default name). This files defines which model to load, and sets up remote serving. Here, I’m binding the server to all public IP addresses, but you can restrict it to a specific address if you want to. As this is running on an EC2 instance, I need to make sure that ports 8080 and 8081 are open in the Security Group.

model_store=model_store
load_models=densenet161.mar
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081

Now I can start TorchServe in the same directory, without having to pass any command line arguments.

torchserve --start

Moving back to my local machine, I can now invoke TorchServe remotely, and get the same result.

curl -X POST http://ec2-54-85-61-250.compute-1.amazonaws.com:8080/predictions/densenet161 -T kitten.jpg

You probably noticed that I used HTTP. I’m guessing a lot of you will require HTTPS in production, so let me show you how to set it up.

Configuring TorchServe for HTTPS
TorchServe can use either the Java keystore or a certificate. I’ll go with the latter.

First, I create a certificate and a private key with openssl.

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout mykey.key -out mycert.pem

Then, I update the configuration file to define the location of the certificate and key, and I bind TorchServe to its default secure ports (don’t forget to update the Security Group).

model_store=model_store
load_models=densenet161.mar
inference_address=https://0.0.0.0:8443
management_address=https://0.0.0.0:8444
private_key_file=mykey.key
certificate_file=mycert.pem

I restart TorchServe, and I can now invoke it with HTTPS. As I use a self-signed certificate, I need to pass the ‘–insecure’ flag to curl.

curl --insecure -X POST https://ec2-54-85-61-250.compute-1.amazonaws.com:8443/predictions/densenet161 -T kitten.jpg

There’s a lot more to TorchServe configuration, and I encourage you to read its documentation!

Getting Started
TorchServe is available now at https://github.com/pytorch/serve.

Give it a try, and please send us feedback on Github.

– Julien

 

 

 

AWS DeepComposer – Now Generally Available With New Features

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/aws-deepcomposer-now-generally-available-with-new-features/

AWS DeepComposer, a creative way to get started with machine learning, was launched in preview at AWS re:Invent 2019. Today, I’m extremely happy to announce that DeepComposer is now available to all AWS customers, and that it has been expanded with new features.

A primer on AWS DeepComposer
If you’re new to AWS DeepComposer, here’s how to get started.

  • Log into the AWS DeepComposer console.
  • Learn about the service and how it uses generative AI.
  • Record a short musical tune, using either the virtual keyboard in the console, or a physical keyboard available for order on Amazon.com.
  • Select a pretrained model for your favorite genre.
  • Use this model to generate a new polyphonic composition based on your tune.
  • Play the composition in the console.
  • Export the composition, or share it on SoundCloud.

Now let’s look at the new features, which make it even easier to get started with generative AI.

Learning Capsules
DeepComposer is powered by Generative Adversarial Networks (aka GANs, research paper), a neural network architecture built specifically to generate new samples from an existing data set. A GAN pits two different neural networks against each other to produce original digital works based on sample inputs: with DeepComposer, you can train and optimize GAN models to create original music.

Until now, developers interested in growing skills in GANs haven’t had an easy way to get started. In order to help them regardless of their background in ML or music, we are building a collection of easy learning capsules that introduce key concepts, and how to train and evaluate GANs. This includes an hands-on lab with step-by-step instructions and code to build a GAN model.

Once you’re familiar with GANs, you’ll be ready to move on to training your own model!

In-console Training
You now have the ability to train your own generative model right in the DeepComposer console, without having to write a single line of machine learning code.

First, let’s select a GAN architecture:

  • MuseGAN, by Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang and Yi-Hsuan Yang (research paper, Github): MuseGAN has been specifically designed for generating music. The generator in MuseGAN is composed of a shared network to learn a high level representation of the song, and a series of private networks to learn how to generate individual music tracks.
  • U-Net, by Olaf Ronneberger, Philipp Fischer and Thomas Brox (research paper, project page): U-Net has been extremely successful in the image translation domain (e.g. converting winter images to summer images), and it can also be used for music generation. It’s a simpler architecture than MuseGAN, and therefore easier for beginners to understand. If you’re curious what’s happening under the hood, you can learn more about the U-Net architecture in this Jupyter notebook.

Let’s go with MuseGAN, and give the new model a name.

Next, I just have to pick the dataset I want to train my model on.

Optionally, I can also set hyperparameters (i.e. training parameters), but I’ll go with default settings this time. Finally, I click on ‘Start training’, and AWS DeepComposer fires up a training job, taking care of all the infrastructure and machine learning setup for me.

About 8 hours later, the model has been trained, and I can use it to generate compositions. Here, I can add the new ‘rhythm assist’ feature, that helps correct the timing of musical notes in your input, and make sure notes are in time with the beat.

Getting started
AWS DeepComposer is available today in the US East (N. Virginia) region.

The service includes a 12-month Free Tier for all AWS customers, so you can generate 500 compositions using our sample models at no cost.

In addition to the Free Tier, ordering the keyboard from Amazon.com in the US, and linking it to the DeepComposer console will get you another 3 months of free trial!

picture of underside of the keyboard

Give AWS DeepComposer a try, and let us know what you think! You can send your feedback through your usual AWS Support contacts, or on the AWS Forum for DeepComposer.

– Julien

Now Available: Amazon ElastiCache Global Datastore for Redis

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-amazon-elasticache-global-datastore-for-redis/

In-memory data stores are widely used for application scalability, and developers have long appreciated their benefits for storing frequently accessed data, whether volatile or persistent. Systems like Redis help decouple databases and backends from incoming traffic, shedding most of the traffic that would had otherwise reached them, and reducing application latency for users.

Obviously, managing these servers is a critical task, and great care must be taken to keep them up and running no matter what. In a previous job, my team had to move a cluster of physical cache servers across hosting suites: one by one, they connected them to external batteries, unplugged external power, unracked them, and used an office trolley (!) to roll them to the other suite where they racked them again! It happened without any service interruption, but we all breathed a sigh of relief once this was done… Lose cache data on a high-traffic platform, and things get ugly. Fast. Fortunately, cloud infrastructure is more flexible! To help minimize service disruption should an incident occur, we have added many high-availability features to Amazon ElastiCache, our managed in-memory data store for Memcached and Redis: cluster mode, multi-AZ with automatic failover, etc.

As Redis is often used to serve low latency traffic to global users, customers have told us that they’d love to be able to replicate Amazon ElastiCache clusters across AWS regions. We listened to them, got to work, and today, we’re very happy to announce that this replication capability is now available for Redis clusters.

Introducing Amazon ElastiCache Global Datastore For Redis
In a nutshell, Amazon ElastiCache Global Datastore for Redis lets you replicate a cluster in one region to clusters in up to two other regions. Customers typically do this in order to:

  • Bring cached data closer to your users, in order to reduce network latency and improve application responsiveness.
  • Build disaster recovery capabilities, should a region be partially or totally unavailable.

Setting up a global datastore is extremely easy. First, you pick a cluster to be the primary cluster receiving writes from applications: this can either be a new cluster, or an existing cluster provided that it runs Redis 5.0.6 or above. Then, you add up to two secondary clusters in other regions which will receive updates from the primary.

This setup is available for all Redis configurations except single node clusters: of course, you can convert a single node cluster to a replication group cluster, and then use it as a primary cluster.

Last but not least, clusters that are part of a global datastore can be modified and resized as usual (adding or removing nodes, changing node type, adding or removing shards, adding or removing replica nodes).

Let’s do a quick demo.

Replicating a Redis Cluster Across Regions
Let me show you how to build from scratch a three-cluster global datastore: the primary cluster will be located in the us-east-1 region, and the two secondary clusters will be located in the us-west-1 and us-west-2 regions. For the sake of simplicity, I’ll use the same default configuration for all clusters: three cache.r5.large nodes, multi-AZ, one shard.

Heading out to the AWS Console, I click on ‘Global Datastore’, and then on ‘Create’ to create my global datastore. I’m asked if I’d like to create a new cluster supporting the datastore, or if I’d rather use an existing cluster. I go for the former, and create a cluster named global-ds-1-useast1.

I click on ‘Next’, and fill in details for a secondary cluster hosted in the us-west-1 region. I unimaginatively name it global-ds-1-us-west1.

Then, I add another secondary cluster in the us-west-2 region, named global-ds-1-uswest2: I go to ‘Global Datastore’, click on ‘Add Region’, and fill in cluster details.

A little while later, all three clusters are up, and have been associated to the global datastore.

Using the redis-cli client running on an Amazon Elastic Compute Cloud (EC2) instance hosted in the us-east-1 region, I can quickly connect to the cluster endpoint and check that it’s indeed operational.

[us-east-1-instance] $ redis-cli -h $US_EAST_1_CLUSTER_READWRITE
> ping
PONG
> set paris france
OK
> set berlin germany
OK
> set london uk
OK
> keys *
1) "london"
2) "berlin"
3) "paris"
> get paris
"france"

This looks fine. Using an EC2 instance hosted in the us-west-1 region, let’s now check that the data we stored in the primary cluster has been replicated to the us-west-1 secondary cluster.

[us-west-1-instance] $ redis-cli -h $US_WEST_1_CLUSTER_READONLY
> keys *
1) "london"
2) "berlin"
3) "paris"
> get paris
"france"

Nice. Now let’s add some more data on the primary cluster…

> hset Parsifal composer "Richard Wagner" date 1882 acts 3 language "German"
> hset DonGiovanni composer "W.A. Mozart" date 1787 acts 2 language "Italian"
> hset Tosca composer "Giacomo Puccini" date 1900 acts 3 language "Italian"

…and check as quickly as possible on the secondary cluster.

> keys *
1) "DonGiovanni"
2) "london"
3) "berlin"
4) "Parsifal"
5) "Tosca"
6) "paris"
> hget Parsifal composer
"Richard Wagner"

That was fast: by the time I switched to the other terminal and ran the command, the new data was already there. That’s not really surprising since the typical network latency for cross region traffic ranges from 60 milliseconds to 200 milliseconds depending on regions.

Now, what would happen if something went wrong with our primary cluster hosted in us-east-1? Well, we could easily promote one of the secondary clusters to full read/write capabilities.

For good measure, I also remove the us-east-1 cluster from the global datastore. Once this is complete, the global datastore looks like this.

Now, using my EC2 instance in the us-west-1 region, and connecting to the read/write endpoint of my cluster, I add more data…

[us-west-1-instance] $ redis-cli -h $US_WEST_1_CLUSTER_READWRITE
> hset Lohengrin composer "Richard Wagner" date 1850 acts 3 language "German"

… and check that it’s been replicated to the us-west-2 cluster.

[us-west-2-instance] $ redis-cli -h $US_WEST_2_CLUSTER_READONLY
> hgetall Lohengrin
1) "composer"
2) "Richard Wagner"
3) "date"
4) "1850"
5) "acts"
6) "3"
7) "language"
8) "German"

It’s all there. Global datastores make it really easy to replicate Amazon ElastiCache data across regions!

Now Available!
This new global datastore feature is available today in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London). Please give it a try and send us feedback, either on the AWS forum for Amazon ElastiCache, or through your usual AWS support contacts.

Julien;

Now available in Amazon Transcribe: Automatic Redaction of Personally Identifiable Information

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-in-amazon-transcribe-automatic-redaction-of-personally-identifiable-information/

Launched at AWS re:Invent 2017, Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications. At the time of writing, Transcribe supports 31 languages, 6 of which can be transcribed in real-time.

A popular use case for Transcribe is the automatic transcription of customer calls (call centers, telemarketing, etc.), in order to build data sets for downstream analytics and natural language processing tasks, such as sentiment analysis. Thus, any Personally Identifiable Information (PII) should be removed to protect privacy, and comply with local laws and regulations.

As you can imagine, doing this manually is quite tedious, time-consuming, and error-prone, which is why we’re extremely happy to announce that Amazon Transcribe now supports automatic redaction of PII.

Introducing Content Redaction in Amazon Transcribe
If instructed to do so, Transcribe will automatically identify the following pieces of PII:

  • Social Security Number,
  • Credit card/Debit card number,
  • Credit card/Debit card expiration date,
  • Credit card/Debit card CVV code,
  • Bank account number,
  • Bank routing number,
  • Debit/Credit card PIN,
  • Name,
  • Email address,
  • Phone number (10 digits),
  • Mailing address.

They will be replaced with a ‘[PII]’ tag in the transcribed text. You also get a redaction confidence score (instead of the usual ASR score), as well as start and end timestamps. These timestamps will help you locate PII in your audio files for secure storage and sharing, or for additional audio processing to redact it at the source.

This feature is extremely easy to use, let’s do a quick demo.

Redacting Personal Information with Amazon Transcribe
First, I’ve recorded a short sound file full of personal information (of course, it’s all fake). I’m using the mp3 format here, but we recommend that you use lossless formats like FLAC or WAV for maximum accuracy.

Then, I upload this file to an S3 bucket using the AWS CLI.

$ aws s3 cp julien.mp3 s3://jsimon-transcribe-us-east-1

The next step is to transcribe this sound file using the StartTranscriptionJob API: why not use the AWS SDK for PHP this time?

<?php
require 'aws.phar';

use Aws\TranscribeService\TranscribeServiceClient;

$client = new TranscribeServiceClient([
    'profile' => 'default',
    'region' => 'us-east-1',
    'version' => '2017-10-26'
]);

$result = $client->startTranscriptionJob([
    'LanguageCode' => 'en-US',
    'Media' => [
        'MediaFileUri' => 's3://jsimon-transcribe-us-east-1/julien.mp3',
    ],
    'MediaFormat' => 'mp3',
    'OutputBucketName' => 'jsimon-transcribe-us-east-1',
    'ContentRedaction' => [
        'RedactionType' => 'PII',
        'RedactionOutput' => 'redacted'
    ],
    'TranscriptionJobName' => 'redaction'
]);
?>

A single API call is really all it takes. The RedactionOutput parameter lets me control whether I want both the full and the redacted output, or just the redacted output. I go for the latter. Now, let’s run this script.

$ php transcribe.php

Immediately, I can see the job running in the Transcribe console.

I could also use the GetTranscriptionJob and ListTranscriptionJobs APIs to check that content redaction has been applied. Once the job is complete, I simply fetch the transcription from my S3 bucket.

$ aws s3 cp s3://jsimon-transcribe-us-east-1/redacted-redactiontest.json .

The transcription is a JSON document containing detailed information about each word. Here, I’m only interested in the full transcript, so I use a nice open source tool called jq to filter the document.

$ cat redacted-redactiontest.json| jq '.results.transcripts'
[{
"transcript": "Good morning, everybody. My name is [PII], and today I feel like sharing a whole lot of personal information with you. Let's start with my Social Security number [PII]. My credit card number is [PII] And my C V V code is [PII] My bank account number is [PII] My email address is [PII], and my phone number is [PII]. Well, I think that's it. You know a whole lot about me. And I hope that Amazon transcribe is doing a good job at redacting that personal information away. Let's check."
}]

Well done, Amazon Transcribe. My privacy is safe.

Now available!
The content redaction feature is available for US English in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), AWS GovCloud (US-West),
  • Canada (Central), South America (São Paulo),
  • Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt),
  • Middle East (Bahrain),
  • Asia Pacific (Mumbai), Asia Pacific (Hong Kong), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).

Take a look at the pricing page, give the feature a try, and please send us feedback either in the AWS forum for Amazon Transcribe or through your usual AWS support contacts.

– Julien

 

 

 

EC2 Price Reduction in the São Paulo Region (R5 and I3)

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/ec2-price-reduction-in-the-sao-paulo-region-r5-and-i3/

I’ve got good news for AWS customers using our South America (São Paulo) Region!

Effective February 1, 2020 we are reducing prices for On-Demand, Reserved and Dedicated Instances as follows:

  • All R5 families (R5, R5a, R5d, R5ad) – Up to 25%.
  • All I3 families (I3, I3en) – 13%.

The pricing pages have been updated.

Questions?
If you need assistance or have feedback, please reach out to your usual AWS support contacts, or post a message in the AWS Forum for Amazon EC2.

– Julien

Update on Amazon Linux AMI end-of-life

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/update-on-amazon-linux-ami-end-of-life/

Launched in September 2010, the Amazon Linux AMI has helped numerous customers build Linux-based applications on Amazon Elastic Compute Cloud (EC2). In order to bring them even more security, stability, and productivity, we introduced Amazon Linux 2 in 2017. Adding many modern features, Amazon Linux 2 is backed by long-term support, and we strongly encourage you to use it for your new applications.

As stated in the FAQ, we documented that the last version of the Amazon Linux AMI (2018.03) would be end-of-life on June 30, 2020. Based on customer feedback, we are extending the end-of-life date, and we’re also announcing a maintenance support period.

End-of-life Extension
The end-of-life for Amazon Linux AMI is now extended to December 31, 2020: until then, we will continue to provide security updates and refreshed versions of packages as needed.

Maintenance Support
Beyond December 31, 2020, the Amazon Linux AMI will enter a new maintenance support period that extends to June 30, 2023.

During this maintenance support period:

  • The Amazon Linux AMI will only receive critical and important security updates for a reduced set of packages.
  • It will no longer be guaranteed to support new EC2 platform capabilities, or new AWS features.

Supported packages will include:

  • The Linux kernel,
  • Low-level system libraries such as glibc and openssl,
  • Popular packages that are still in a supported state in their upstream sources, such as MySQL and PHP.

We will provide a detailed list of supported and unsupported packages in future posts.

Questions?
If you need assistance or have feedback, please reach out to your usual AWS support contacts, or post a message in the AWS Forum for Amazon Linux. Thank you for using Amazon Linux AMI!

– Julien

 

Amazon SageMaker Studio: The First Fully Integrated Development Environment For Machine Learning

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/

Today, we’re extremely happy to launch Amazon SageMaker Studio, the first fully integrated development environment (IDE) for machine learning (ML).

We have come a long way since we launched Amazon SageMaker in 2017, and it is shown in the growing number of customers using the service. However, the ML development workflow is still very iterative, and is challenging for developers to manage due to the relative immaturity of ML tooling. Many of the tools which developers take for granted when building traditional software (debuggers, project management, collaboration, monitoring, and so forth) have yet been invented for ML.

For example, when trying a new algorithm or tweaking hyper parameters, developers and data scientists typically run hundreds and thousands of experiments on Amazon SageMaker, and they need to manage all this manually. Over time, it becomes much harder to track the best performing models, and to capitalize on lessons learned during the course of experimentation.

Amazon SageMaker Studio unifies at last all the tools needed for ML development. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity.

In addition, since all these steps of the ML workflow are tracked within the environment, developers can quickly move back and forth between steps, and also clone, tweak, and replay them. This gives developers the ability to make changes quickly, observe outcomes, and iterate faster, reducing the time to market for high quality ML solutions.

Introducing Amazon SageMaker Studio
Amazon SageMaker Studio lets you manage your entire ML workflow through a single pane of glass. Let me give you the whirlwind tour!

With Amazon SageMaker Notebooks (currently in preview), you can enjoy an enhanced notebook experience that lets you easily create and share Jupyter notebooks. Without having to manage any infrastructure, you can also quickly switch from one hardware configuration to another.

With Amazon SageMaker Experiments, you can organize, track and compare thousands of ML jobs: these can be training jobs, or data processing and model evaluation jobs run with Amazon SageMaker Processing.

With Amazon SageMaker Debugger, you can debug and analyze complex training issues, and receive alerts. It automatically introspects your models, collects debugging data, and analyzes it to provide real-time alerts and advice on ways to optimize your training times, and improve model quality. All information is visible as your models are training.

With Amazon SageMaker Model Monitor, you can detect quality deviations for deployed models, and receive alerts. You can easily visualize issues like data drift that could be affecting your models. No code needed: all it takes is a few clicks.

With Amazon SageMaker Autopilot, you can build models automatically with full control and visibility. Algorithm selection, data preprocessing, and model tuning are taken care automatically, as well as all infrastructure.

Thanks to these new capabilities, Amazon SageMaker now covers the complete ML workflow to build, train, and deploy machine learning models, quickly and at any scale.

These services mentioned above, except for Amazon SageMaker Notebooks, are covered in individual blog posts (see below) showing you how to quickly get started, so keep your eyes peeled and read on!

Now Available!
Amazon SageMaker Studio is available today in US East (Ohio).

Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

Amazon SageMaker Debugger – Debug Your Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-debugger-debug-your-machine-learning-models/

Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs.

Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows.

As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc.

To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bona fide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger.

Let me tell you more.

Introducing Amazon SageMaker Debugger
In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (S3).

This state is composed of:

  • The parameters being learned by the model, e.g. weights and biases for neural networks,
  • The changes applied to these parameters by the optimizer, aka gradients,
  • The optimization parameters themselves,
  • Scalar values, e.g. accuracies and losses,
  • The output of each layer,
  • Etc.

Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply.

A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules.

Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps.

So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause.

Let’s see SageMaker Debugger in action with a quick demo.

Debugging Machine Learning Models with Amazon SageMaker Debugger
At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.).

For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository.

Let’s take a look at the initial code:

import argparse
import numpy as np
import tensorflow as tf
import random

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001)
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100)
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0)

args = parser.parse_args()

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

Let’s train this script using the TensorFlow Estimator. I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000}

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='local',
    entry_point='script-v1.py',
    framework_version='1.13.1',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters)

Looking at the training log, things did not go well.

Step=0, Loss=7.883463958023267e+23
algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23
algo-1-hrvqg_1 | Step=2, Loss=nan
algo-1-hrvqg_1 | Step=3, Loss=nan
algo-1-hrvqg_1 | Step=4, Loss=nan
algo-1-hrvqg_1 | Step=5, Loss=nan
algo-1-hrvqg_1 | Step=6, Loss=nan
algo-1-hrvqg_1 | Step=7, Loss=nan
algo-1-hrvqg_1 | Step=8, Loss=nan
algo-1-hrvqg_1 | Step=9, Loss=nan

Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work.

Using the Amazon SageMaker Debugger SDK
In order to capture tensors, I need to instrument the training script with:

  • A SaveConfig object specifying the frequency at which tensors should be saved,
  • A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training,
  • An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors,
  • An (optional) optimizer wrapper to capture gradients.

Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters.

import argparse
import numpy as np
import tensorflow as tf
import random
import smdebug.tensorflow as smd

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 )
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 )
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 )
parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors')
parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10)
feature_parser = parser.add_mutually_exclusive_group(required=False)
feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors")
feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors")
args = parser.parse_args()
args = parser.parse_args()

reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None

hook = smd.SessionHook(out_dir=args.debug_path,
                       include_collections=['weights', 'gradients', 'losses'],
                       save_config=smd.SaveConfig(save_interval=args.debug_frequency),
                       reduction_config=reduc)

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")
    hook.add_to_collection('losses', loss)

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer = hook.wrap_optimizer(optimizer)
optimizer_op = optimizer.minimize(loss)

hook.set_mode(smd.modes.TRAIN)

with tf.train.MonitoredSession(hooks=[hook]) as sess:
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

I also need to modify the TensorFlow Estimator, to use the SageMaker Debugger-enabled training container and to pass additional parameters.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

from sagemaker.debugger import Rule, rule_configs
estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='ml.c5.2xlarge',
    image_name=cpu_docker_image_name,
    entry_point='script-v2.py',
    framework_version='1.15',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters,
    rules = [Rule.sagemaker(rule_configs.exploding_tensor())]
)

estimator.fit()
2019-11-27 10:42:02 Starting - Starting the training job...
2019-11-27 10:42:25 Starting - Launching requested ML instances
********* Debugger Rule Status *********
*
* ExplodingTensor: InProgress 
*
****************************************

Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator. Quickly, the debug job fails!

Describing the training job, I can get more information on what happened.

description = client.describe_training_job(TrainingJobName=job_name)
print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName'])
print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus'])

ExplodingTensor
IssuesFound

Let’s take a look at the saved tensors.

Exploring Tensors
I can easily grab the tensors saved in S3 during the training process.

s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"]
trial = create_trial(s3_output_path)

Let’s list available tensors.

trial.tensors()

['loss/loss:0',
'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0',
'initialize/weight1:0']

All values are numpy arrays, and I can easily iterate over them.

tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0'
for s in list(trial.tensor(tensor).steps()):
    print("Value: ", trial.tensor(tensor).step(s).value)

Value:  [[1.1508383e+23] [1.0809098e+23]]
Value:  [[1.0278440e+23] [1.1347468e+23]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]

As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication.

# Compute true label
y = tf.matmul(x, w0)
# Compute "predicted" label
y_hat = tf.matmul(x, w)

Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo!

x_ = np.random.random((10, 2)) * args.scale

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

As you probably knew all along, setting these hyperpameteres to more reasonable values will fix the training issue.

Now Available!
We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting.

Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

 

 

Amazon SageMaker Model Monitor – Fully Managed Automatic Monitoring For Your Machine Learning Models

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-model-monitor-fully-managed-automatic-monitoring-for-your-machine-learning-models/

Today, we’re extremely happy to announce Amazon SageMaker Model Monitor, a new capability of Amazon SageMaker that automatically monitors machine learning (ML) models in production, and alerts you when data quality issues appear.

The first thing I learned when I started working with data is that there is no such thing as paying too much attention to data quality. Raise your hand if you’ve spent hours hunting down problems caused by unexpected NULL values or by exotic character encodings that somehow ended up in one of your databases.

As models are literally built from large amounts of data, it’s easy to see why ML practitioners spend so much time caring for their data sets. In particular, they make sure that data samples in the training set (used to train the model) and in the validation set (used to measure its accuracy) have the same statistical properties.

There be monsters! Although you have full control over your experimental data sets, the same can’t be said for real-life data that your models will receive. Of course, that data will be unclean, but a more worrisome problem is “data drift”, i.e. a gradual shift in the very statistical nature of the data you receive. Minimum and maximum values, mean, average, variance, and more: all these are key attributes that shape assumptions and decisions made during the training of a model. Intuitively, you can surely feel that any significant change in these values would impact the accuracy of predictions: imagine a loan application predicting higher amounts because input features are drifting or even missing!

Detecting these conditions is pretty difficult: you would need to capture data received by your models, run all kinds of statistical analysis to compare that data to the training set, define rules to detect drift, send alerts if it happens… and do it all over again each time you update your models. Expert ML practitioners certainly know how to build these complex tools, but at the great expense of time and resources. Undifferentiated heavy lifting strikes again…

To help all customers focus on creating value instead, we built Amazon SageMaker Model Monitor. Let me tell you more.

Introducing Amazon SageMaker Model Monitor
A typical monitoring session goes like this. You first start from a SageMaker endpoint to monitor, either an existing one, or a new one created specifically for monitoring purposes. You can use SageMaker Model Monitor on any endpoint, whether the model was trained with a built-in algorithm, a built-in framework, or your own container.

Using the SageMaker SDK, you can capture a configurable fraction of the data sent to the endpoint (you can also capture predictions if you’d like), and store it in one of your Amazon Simple Storage Service (S3) buckets. Captured data is enriched with metadata (content type, timestamp, etc.), and you can secure and access it just like any S3 object.

Then, you create a baseline from the data set that was used to train the model deployed on the endpoint (of course, you can reuse an existing baseline too). This will fire up a Amazon SageMaker Processing job where SageMaker Model Monitor will:

  • Infer a schema for the input data, i.e. type and completeness information for each feature. You should review it, and update it if needed.
  • For pre-built containers only, compute feature statistics using Deequ, an open source tool based on Apache Spark that is developed and used at Amazon (blog post and research paper). These statistics include KLL sketches, an advanced technique to compute accurate quantiles on streams of data, that we recently contributed to Deequ.

Using these artifacts, the next step is to launch a monitoring schedule, to let SageMaker Model Monitor inspect collected data and prediction quality. Whether you’re using a built-in or custom container, a number of built-in rules are applied, and reports are periodically pushed to S3. The reports contain statistics and schema information on the data received during the latest time frame, as well as any violation that was detected.

Last but not least, SageMaker Model Monitor emits per-feature metrics to Amazon CloudWatch, which you can use to set up dashboards and alerts. The summary metrics from CloudWatch are also visible in Amazon SageMaker Studio, and of course all statistics, monitoring results and data collected can be viewed and further analyzed in a notebook.

For more information and an example on how to use SageMaker Model Monitor using AWS CloudFormation, refer to the developer guide.

Now, let’s do a demo, using a churn prediction model trained with the built-in XGBoost algorithm.

Enabling Data Capture
The first step is to create an endpoint configuration to enable data capture. Here, I decide to capture 100% of incoming data, as well as model output (i.e. predictions). I’m also passing the content types for CSV and JSON data.

data_capture_configuration = {
    "EnableCapture": True,
    "InitialSamplingPercentage": 100,
    "DestinationS3Uri": s3_capture_upload_path,
    "CaptureOptions": [
        { "CaptureMode": "Output" },
        { "CaptureMode": "Input" }
    ],
    "CaptureContentTypeHeader": {
       "CsvContentTypes": ["text/csv"],
       "JsonContentTypes": ["application/json"]
}

Next, I create the endpoint using the usual CreateEndpoint API.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTrafficVariant'
    }],
    DataCaptureConfig = data_capture_configuration)

On an existing endpoint, I would have used the UpdateEndpoint API to seamlessly update the endpoint configuration.

After invoking the endpoint repeatedly, I can see some captured data in S3 (output was edited for clarity).

$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/datacapture/DEMO-xgb-churn-pred-model-monitor-2019-11-22-07-59-33/
AllTrafficVariant/2019/11/22/08/24-40-519-9a9273ca-09c2-45d3-96ab-fc7be2402d43.jsonl
AllTrafficVariant/2019/11/22/08/25-42-243-3e1c653b-8809-4a6b-9d51-69ada40bc809.jsonl

Here’s a line from one of these files.

    "endpointInput":{
        "observedContentType":"text/csv",
        "mode":"INPUT",
        "data":"132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1",
        "encoding":"CSV"
     },
     "endpointOutput":{
        "observedContentType":"text/csv; charset=utf-8",
        "mode":"OUTPUT",
        "data":"0.01076381653547287",
        "encoding":"CSV"}
     },
    "eventMetadata":{
        "eventId":"6ece5c74-7497-43f1-a263-4833557ffd63",
        "inferenceTime":"2019-11-22T08:24:40Z"},
        "eventVersion":"0"}

Pretty much what I expected. Now, let’s create a baseline for this model.

Creating A Monitoring Baseline
This is a very simple step: pass the location of the baseline data set, and the location where results should be stored.

from processingjob_wrapper import ProcessingJob

processing_job = ProcessingJob(sm_client, role).
   create(job_name, baseline_data_uri, baseline_results_uri)

Once that job is complete, I can see two new objects in S3: one for statistics, and one for constraints.

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/baselining/results/
constraints.json
statistics.json

The constraints.json file tells me about the inferred schema for the training data set (don’t forget to check it’s accurate). Each feature is typed, and I also get information on whether a feature is always present or not (1.0 means 100% here). Here are the first few lines.

{
  "version" : 0.0,
  "features" : [ {
    "name" : "Churn",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Account Length",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "VMail Message",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Day Mins",
    "inferred_type" : "Fractional",
    "completeness" : 1.0
  }, {
    "name" : "Day Calls",
    "inferred_type" : "Integral",
    "completeness" : 1.0

At the end of that file, I can see configuration information for CloudWatch monitoring: turn it on or off, set the drift threshold, etc.

"monitoring_config" : {
    "evaluate_constraints" : "Enabled",
    "emit_metrics" : "Enabled",
    "distribution_constraints" : {
      "enable_comparisons" : true,
      "min_domain_mass" : 1.0,
      "comparison_threshold" : 1.0
    }
  }

The statistics.json file shows different statistics for each feature (mean, average, quantiles, etc.), as well as unique values received by the endpoint. Here’s an example.

"name" : "Day Mins",
    "inferred_type" : "Fractional",
    "numerical_statistics" : {
      "common" : {
        "num_present" : 2333,
        "num_missing" : 0
      },
      "mean" : 180.22648949849963,
      "sum" : 420468.3999999996,
      "std_dev" : 53.987178959901556,
      "min" : 0.0,
      "max" : 350.8,
      "distribution" : {
        "kll" : {
          "buckets" : [ {
            "lower_bound" : 0.0,
            "upper_bound" : 35.08,
            "count" : 14.0
          }, {
            "lower_bound" : 35.08,
            "upper_bound" : 70.16,
            "count" : 48.0
          }, {
            "lower_bound" : 70.16,
            "upper_bound" : 105.24000000000001,
            "count" : 130.0
          }, {
            "lower_bound" : 105.24000000000001,
            "upper_bound" : 140.32,
            "count" : 318.0
          }, {
            "lower_bound" : 140.32,
            "upper_bound" : 175.4,
            "count" : 565.0
          }, {
            "lower_bound" : 175.4,
            "upper_bound" : 210.48000000000002,
            "count" : 587.0
          }, {
            "lower_bound" : 210.48000000000002,
            "upper_bound" : 245.56,
            "count" : 423.0
          }, {
            "lower_bound" : 245.56,
            "upper_bound" : 280.64,
            "count" : 180.0
          }, {
            "lower_bound" : 280.64,
            "upper_bound" : 315.72,
            "count" : 58.0
          }, {
            "lower_bound" : 315.72,
            "upper_bound" : 350.8,
            "count" : 10.0
          } ],
          "sketch" : {
            "parameters" : {
              "c" : 0.64,
              "k" : 2048.0
            },
            "data" : [ [ 178.1, 160.3, 197.1, 105.2, 283.1, 113.6, 232.1, 212.7, 73.3, 176.9, 161.9, 128.6, 190.5, 223.2, 157.9, 173.1, 273.5, 275.8, 119.2, 174.6, 133.3, 145.0, 150.6, 220.2, 109.7, 155.4, 172.0, 235.6, 218.5, 92.7, 90.7, 162.3, 146.5, 210.1, 214.4, 194.4, 237.3, 255.9, 197.9, 200.2, 120, ...

Now, let’s start monitoring our endpoint.

Monitoring An Endpoint
Again, one API call is all that it takes: I simply create a monitoring schedule for my endpoint, passing the constraints and statistics file for the baseline data set. Optionally, I could also pass preprocessing and postprocessing functions, should I want to tweak data and predictions.

ms = MonitoringSchedule(sm_client, role)
schedule = ms.create(
   mon_schedule_name, 
   endpoint_name, 
   s3_report_path, 
   # record_preprocessor_source_uri=s3_code_preprocessor_uri, 
   # post_analytics_source_uri=s3_code_postprocessor_uri,
   baseline_statistics_uri=baseline_results_uri + '/statistics.json',
   baseline_constraints_uri=baseline_results_uri+ '/constraints.json'
)

Then, I start sending bogus data to the endpoint, i.e. samples constructed from random values, and I wait for SageMaker Model Monitor to start generating reports. The suspense is killing me!

Inspecting Reports
Quickly, I see that reports are available in S3.

mon_executions = sm_client.list_monitoring_executions(MonitoringScheduleName=mon_schedule_name, MaxResults=3)
for execution_summary in mon_executions['MonitoringExecutionSummaries']:
    print("ProcessingJob: {}".format(execution_summary['ProcessingJobArn'].split('/')[1]))
    print('MonitoringExecutionStatus: {} \n'.format(execution_summary['MonitoringExecutionStatus']))

ProcessingJob: model-monitoring-201911221050-df2c7fc4
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221040-3a738dd7
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221030-83f15fb9
MonitoringExecutionStatus: Completed 

Let’s find the reports for one of these monitoring jobs.

desc_analytics_job_result=sm_client.describe_processing_job(ProcessingJobName=job_name)
report_uri=desc_analytics_job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']
print('Report Uri: {}'.format(report_uri))

Report Uri: s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209

Ok, so what do we have here?

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209/

constraint_violations.json
constraints.json
statistics.json

As you would expect, the constraints.json and statistics.json contain schema and statistics information on the data samples processed by the monitoring job. Let’s open directly the third one, constraints_violations.json!

violations" : [ {
    "feature_name" : "State_AL",
    "constraint_check_type" : "data_type_check",
    "description" : "Value: 0.8 does not meet the constraint requirement! "
  }, {
    "feature_name" : "Eve Mins",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.2711598746081505 exceeds numerical threshold: 0"
  }, {
    "feature_name" : "CustServ Calls",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.6470588235294117 exceeds numerical threshold: 0"
  }

Oops! It looks like I’ve been assigning floating point values to integer features: surely that’s not going to work too well!

Some features are also exhibiting drift, that’s not good either. Maybe something is wrong my data ingestion process, or maybe the distribution of data has actually changed, and I need to retrain the model. As all this information is available as CloudWatch metrics, I could define thresholds, set alarms and even trigger new training jobs automatically.

Now Available!
As you can see, Amazon SageMaker Model Monitor is easy to set up, and helps you quickly know about quality issues in your ML models.

Now it’s your turn: you can start using Amazon SageMaker Model Monitor today in all commercial regions where Amazon SageMaker is available. This capability is also integrated in Amazon SageMaker Studio, our workbench for ML projects. Last but not least, all information can be viewed and further analyzed in a notebook.

Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

– Julien

Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/

Today, we’re extremely happy to launch Amazon SageMaker Processing, a new capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure.

Training an accurate machine learning (ML) model requires many different steps, but none is potentially more important than preprocessing your data set, e.g.:

  • Converting the data set to the input format expected by the ML algorithm you’re using,
  • Transforming existing features to a more expressive representation, such as one-hot encoding categorical features,
  • Rescaling or normalizing numerical features,
  • Engineering high level features, e.g. replacing mailing addresses with GPS coordinates,
  • Cleaning and tokenizing text for natural language processing applications,
  • And more!

These tasks involve running bespoke scripts on your data set, (beneath a moonless sky, I’m told) and saving the processed version for later use by your training jobs. As you can guess, running them manually or having to build and scale automation tools is not an exciting prospect for ML teams. The same could be said about postprocessing jobs (filtering, collating, etc.) and model evaluation jobs (scoring models against different test sets).

Solving this problem is why we built Amazon SageMaker Processing. Let me tell you more.

Introducing Amazon SageMaker Processing
Amazon SageMaker Processing introduces a new Python SDK that lets data scientists and ML engineers easily run preprocessing, postprocessing and model evaluation workloads on Amazon SageMaker.

This SDK uses SageMaker’s built-in container for scikit-learn, possibly the most popular library one for data set transformation.

If you need something else, you also have the ability to use your own Docker images without having to conform to any Docker image specification: this gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon ECS and Amazon Elastic Kubernetes Service, or even on premise.

How about a quick demo with scikit-learn? Then, I’ll briefly discuss using your own container. Of course, you’ll find complete examples on Github.

Preprocessing Data With The Built-In Scikit-Learn Container
Here’s how to use the SageMaker Processing SDK to run your scikit-learn jobs.

First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements.

from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_count=1,
                                     instance_type='ml.m5.xlarge')

Then, we can run our preprocessing script (more on this fellow in a minute) like so:

  • The data set (dataset.csv) is automatically copied inside the container under the destination directory (/input). We could add additional inputs if needed.
  • This is where the Python script (preprocessing.py) reads it. Optionally, we could pass command line arguments to the script.
  • It preprocesses it, splits it three ways, and saves the files inside the container under /opt/ml/processing/output/train, /opt/ml/processing/output/validation, and /opt/ml/processing/output/test.
  • Once the job completes, all outputs are automatically copied to your default SageMaker bucket in S3.
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
    code='preprocessing.py',
    # arguments = ['arg1', 'arg2'],
    inputs=[ProcessingInput(
        source='dataset.csv',
        destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
        ProcessingOutput(source='/opt/ml/processing/output/validation'),
        ProcessingOutput(source='/opt/ml/processing/output/test')]
)

That’s it! Let’s put everything together by looking at the skeleton of the preprocessing script.

import pandas as pd
from sklearn.model_selection import train_test_split
# Read data locally 
df = pd.read_csv('/opt/ml/processing/input/dataset.csv')
# Preprocess the data set
downsampled = apply_mad_data_science_skills(df)
# Split data set into training, validation, and test
train, test = train_test_split(downsampled, test_size=0.2)
train, validation = train_test_split(train, test_size=0.2)
# Create local output directories
try:
    os.makedirs('/opt/ml/processing/output/train')
    os.makedirs('/opt/ml/processing/output/validation')
    os.makedirs('/opt/ml/processing/output/test')
except:
    pass
# Save data locally
train.to_csv("/opt/ml/processing/output/train/train.csv")
validation.to_csv("/opt/ml/processing/output/validation/validation.csv")
test.to_csv("/opt/ml/processing/output/test/test.csv")
print('Finished running processing job')

A quick look to the S3 bucket confirms that files have been sucessfully processed and saved. Now I could use them directly as input for a SageMaker training job .

$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker-scikit-learn-2019-11-20-13-57-17-805/output
2019-11-20 15:03:22 19967 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/test.csv
2019-11-20 15:03:22 64998 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/train.csv
2019-11-20 15:03:22 18058 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/validation.csv

Now what about using your own container?

Processing Data With Your Own Container
Let’s say you’d like to preprocess text data with the popular spaCy library. Here’s how you could define a vanilla Docker container for it.

FROM python:3.7-slim-buster
# Install spaCy, pandas, and an english language model for spaCy.
RUN pip3 install spacy==2.2.2 && pip3 install pandas==0.25.3
RUN python3 -m spacy download en_core_web_md
# Make sure python doesn't buffer stdout so we get logs ASAP.
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

Then, you would build the Docker container, test it locally, and push it to Amazon Elastic Container Registry, our managed Docker registry service.

The next step would be to configure a processing job using the ScriptProcessor object, passing the name of the container you built and pushed.

from sagemaker.processing import ScriptProcessor
script_processor = ScriptProcessor(image_uri='123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spacy-container:latest',
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

Finally, you would run the job just like in the previous example.

script_processor.run(code='spacy_script.py',
    inputs=[ProcessingInput(
        source='dataset.csv',
        destination='/opt/ml/processing/input_data')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/processed_data')],
    arguments=['tokenizer', 'lemmatizer', 'pos-tagger']
)

The rest of the process is exactly the same as above: copy the input(s) inside the container, copy the output(s) from the container to S3.

Pretty simple, don’t you think? Again, I focused on preprocessing, but you can run similar jobs for postprocessing and model evaluation. Don’t forget to check out the examples in Github.

Now Available!
Amazon SageMaker Processing is available today in all commercial regions where Amazon SageMaker is available.

Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien

Amazon SageMaker Autopilot – Automatically Create High-Quality Machine Learning Models With Full Control And Visibility

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-autopilot-fully-managed-automatic-machine-learning/

Today, we’re extremely happy to launch Amazon SageMaker Autopilot to automatically create the best classification and regression machine learning models, while allowing full control and visibility.

In 1959, Arthur Samuel defined machine learning as the ability for computers to learn without being explicitly programmed. In practice, this means finding an algorithm than can extract patterns from an existing data set, and use these patterns to build a predictive model that will generalize well to new data. Since then, lots of machine learning algorithms have been invented, giving scientists and engineers plenty of options to choose from, and helping them build amazing applications.

However, this abundance of algorithms also creates a difficulty: which one should you pick? How can you reliably figure out which one will perform best on your specific business problem? In addition, machine learning algorithms usually have a long list of training parameters (also called hyperparameters) that need to be set “just right” if you want to squeeze every bit of extra accuracy from your models. To make things worse, algorithms also require data to be prepared and transformed in specific ways (aka feature engineering) for optimal learning… and you need to pick the best instance type.

If you think this sounds like a lot of experimental, trial and error work, you’re absolutely right. Machine learning is definitely of mix of hard science and cooking recipes, making it difficult for non-experts to get good results quickly.

What if you could rely on a fully managed service to solve that problem for you? Call an API and get the job done? Enter Amazon SageMaker Autopilot.

Introducing Amazon SageMaker Autopilot
Using a single API call, or a few clicks in Amazon SageMaker Studio, SageMaker Autopilot first inspects your data set, and runs a number of candidates to figure out the optimal combination of data preprocessing steps, machine learning algorithms and hyperparameters. Then, it uses this combination to train an Inference Pipeline, which you can easily deploy either on a real-time endpoint or for batch processing. As usual with Amazon SageMaker, all of this takes place on fully-managed infrastructure.

Last but not least, SageMaker Autopilot also generate Python code showing you exactly how data was preprocessed: not only can you understand what SageMaker Autopilot did, you can also reuse that code for further manual tuning if you’re so inclined.

As of today, SageMaker Autopilot supports:

  • Input data in tabular format, with automatic data cleaning and preprocessing,
  • Automatic algorithm selection for linear regression, binary classification, and multi-class classification,
  • Automatic hyperparameter optimization,
  • Distributed training,
  • Automatic instance and cluster size selection.

Let me show you how simple this is.

Using AutoML with Amazon SageMaker Autopilot
Let’s use this sample notebook as a starting point: it builds a binary classification model predicting if customers will accept or decline a marketing offer. Please take a few minutes to read it: as you will see, the business problem itself is easy to understand, and the data set is neither large nor complicated. Yet, several non-intuitive preprocessing steps are required, and there’s also the delicate matter of picking an algorithm and its parameters… SageMaker Autopilot to the rescue!

First, I grab a copy of the data set, and take a quick look at the first few lines.

Then, I upload it in Amazon Simple Storage Service (S3) without any preprocessing whatsoever.

sess.upload_data(path="automl-train.csv", key_prefix=prefix + "/input")

's3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-automl-dm/input/automl-train.csv'

Now, let’s configure the AutoML job:

  • Set the location of the data set,
  • Select the target attribute that I want the model to predict: in this case, it’s the ‘y’ column showing if a customer accepted the offer or not,
  • Set the location of training artifacts.
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/input'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'y'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

That’s it! Of course, SageMaker Autopilot has a number of options that will come in handy as you learn more about your data and your models, e.g.:

  • Set the type of problem you want to train on: linear regression, binary classification, or multi-class classification. If you’re not sure, SageMaker Autopilot will figure it out automatically by analyzing the values of the target attribute.
  • Use a specific metric for model evaluation.
  • Define completion criteria: maximum running time, etc.

One thing I don’t have to do is size the training cluster, as SageMaker Autopilot uses a heuristic based on data size and algorithm. Pretty cool!

With configuration out of the way, I can fire up the job with the CreateAutoMl API.

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

AutoMLJobName: automl-dm-28-10-17-49

A job runs in four steps (you can use the DescribeAutoMlJob API to view them).

  1. Splitting the data set into train and validation sets,
  2. Analyzing data, in order to recommend pipelines that should be tried out on the data set,
  3. Feature engineering, where transformations are applied to the data set and to individual features,
  4.  Pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm.

Once the maximum number of candidates – or one of the stopping conditions – has been reached, the job is complete. I can get detailed information on all candidates using the ListCandidatesForAutoMlJob API , and also view them in the AWS console.

candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1

1 automl-dm-28-tuning-job-1-fabb8-001-f3b6dead 0.9186699986457825
2 automl-dm-28-tuning-job-1-fabb8-004-03a1ff8a 0.918304979801178
3 automl-dm-28-tuning-job-1-fabb8-003-c443509a 0.9181839823722839
4 automl-dm-28-tuning-job-1-ed07c-006-96f31fde 0.9158779978752136
5 automl-dm-28-tuning-job-1-ed07c-004-da2d99af 0.9130859971046448
6 automl-dm-28-tuning-job-1-ed07c-005-1e90fd67 0.9130859971046448
7 automl-dm-28-tuning-job-1-ed07c-008-4350b4fa 0.9119930267333984
8 automl-dm-28-tuning-job-1-ed07c-007-dae75982 0.9119930267333984
9 automl-dm-28-tuning-job-1-ed07c-009-c512379e 0.9119930267333984
10 automl-dm-28-tuning-job-1-ed07c-010-d905669f 0.8873512744903564

For now, I’m only interested in the best trial: 91.87% validation accuracy. Let’s deploy it to a SageMaker endpoint, just like we would deploy any model:

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.2xlarge',
                                                           'InitialInstanceCount':1,
                                                           'ModelName':model_name,
                                                           'VariantName':variant_name}])

create_endpoint_response = sm.create_endpoint(EndpointName=ep_name,
                                              EndpointConfigName=epc_name)

After a few minutes, the endpoint is live, and I can use it for prediction. SageMaker business as usual!

Now, I bet you’re curious about how the model was built, and what the other candidates are. Let me show you.

Full Visibility And Control with Amazon SageMaker Autopilot
SageMaker Autopilot stores training artifacts in S3, including two auto-generated notebooks!

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

print(job_data_notebook)
print(job_candidate_notebook)

s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb

The first one contains information about the data set.

The second one contains full details on the SageMaker Autopilot job: candidates, data preprocessing steps, etc. All code is available, as well as ‘knobs’ you can change for further experimentation.

As you can see, you have full control and visibility on how models are built.

Now Available!
I’m very excited about Amazon SageMaker Autopilot, because it’s making machine learning simpler and more accessible than ever. Whether you’re just beginning with machine learning, or whether you’re a seasoned practitioner, SageMaker Autopilot will help you build better models quicker using either one of these paths:

Now it’s your turn. You can start using SageMaker Autopilot today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon),
  • Canada (Central), South America (São Paulo),
  • Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt),
  • Middle East (Bahrain),
  • Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).

Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien

Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings/

Today, we’re extremely happy to announce Amazon SageMaker Experiments, a new capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions.

ML is a highly iterative process. During the course of a single project, data scientists and ML engineers routinely train thousands of different models in search of maximum accuracy. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite… and therein lies the proverbial challenge of finding a needle in a haystack.

Tools like Automatic Model Tuning and Amazon SageMaker Autopilot help ML practitioners explore a large number of combinations automatically, and quickly zoom in on high-performance models. However, they further add to the explosive growth of training jobs. Over time, this creates a new difficulty for ML teams, as it becomes near-impossible to efficiently deal with hundreds of thousands of jobs: keeping track of metrics, grouping jobs by experiment, comparing jobs in the same experiment or across experiments, querying past jobs, etc.

Of course, this can be solved by building, managing and scaling bespoke tools: however, doing so diverts valuable time and resources away from actual ML work. In the spirit of helping customers focus on ML and nothing else, we couldn’t leave this problem unsolved.

Introducing Amazon SageMaker Experiments
First, let’s define core concepts:

  • A trial is a collection of training steps involved in a single training job. Training steps typically includes preprocessing, training, model evaluation, etc. A trial is also enriched with metadata for inputs (e.g. algorithm, parameters, data sets) and outputs (e.g. models, checkpoints, metrics).
  • An experiment is simply a collection of trials, i.e. a group of related training jobs.

The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, and run analytics across trials and experiments. For this purpose, we introduce a new Python SDK containing logging and analytics APIs.

Running your training jobs on SageMaker or SageMaker Autopilot, all you have to do is pass an extra parameter to the Estimator, defining the name of the experiment that this trial should be attached to. All inputs and outputs will be logged automatically.

Once you’ve run your training jobs, the SageMaker Experiments SDK lets you load experiment and trial data in the popular pandas dataframe format. Pandas truly is the Swiss army knife of ML practitioners, and you’ll be able to perform any analysis that you may need. Go one step further by building cool visualizations with matplotlib, and you’ll be well on your way to taming that wild horde of training jobs!

As you would expect, SageMaker Experiments is nicely integrated in Amazon SageMaker Studio. You can run complex queries to quickly find the past trial you’re looking for. You can also visualize real-time model leaderboards and metric charts.

How about a quick demo?

Logging Training Information With Amazon SageMaker Experiments
Let’s start from a PyTorch script classifying images from the MNIST data set, using a simple two-layer convolution neural network (CNN). If I wanted to run a single job on SageMaker, I could use the PyTorch estimator like so:

estimator = PyTorch(
        entry_point='mnist.py',
        role=role,
        sagemaker_session=sess
        framework_version='1.1.0',
        train_instance_count=1,
        train_instance_type='ml.p3.2xlarge')
    
    estimator.fit(inputs={'training': inputs})

Instead, let’s say that I want to run multiple versions of the same script, changing only one of the hyperparameters (the number of convolution filters used by the two convolution layers, aka number of hidden channels) to measure its impact on model accuracy. Of course, we could run these jobs, grab the training logs, extract metrics with fancy text filtering, etc. Or we could use SageMaker Experiments!

All I need to do is:

  • Set up an experiment,
  • Use a tracker to log experiment metadata,
  • Create a trial for each training job I want to run,
  • Run each training job, passing parameters for the experiment name and the trial name.

First things first, let’s take care of the experiment.

from smexperiments.experiment import Experiment
mnist_experiment = Experiment.create(
    experiment_name="mnist-hand-written-digits-classification", 
    description="Classification of mnist hand-written digits", 
    sagemaker_boto_client=sm)

Then, let’s add a few things that we want to keep track of, like the location of the data set and normalization values we applied to it.

from smexperiments.tracker import Tracker
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
     tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs)
     tracker.log_parameters({
        "normalization_mean": 0.1307,
        "normalization_std": 0.3081,
    })

Now let’s run a few jobs. I simply loop over the different values that I want to try, creating a new trial for each training job and adding the tracker information to it.

for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]):
    trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
    cnn_trial = Trial.create(
        trial_name=trial_name, 
        experiment_name=mnist_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    cnn_trial.add_trial_component(tracker.trial_component)

Then, I configure the estimator, passing the value for the hyperparameter I’m interested in, and leaving the other ones as is. I’m also passing regular expressions to extract metrics from the training log. All these will push stored in the trial: in fact, all parameters (passed or default) will be.

    estimator = PyTorch(
        entry_point='mnist.py',
        role=role,
        sagemaker_session=sess,
        framework_version='1.1.0',
        train_instance_count=1,
        train_instance_type='ml.p3.2xlarge',
        hyperparameters={
            'hidden_channels': num_hidden_channels
        },
        metric_definitions=[
            {'Name':'train:loss', 'Regex':'Train Loss: (.*?);'},
            {'Name':'test:loss', 'Regex':'Test Average loss: (.*?),'},
            {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}
        ]
    )

Finally, I run the training job, associating it to the experiment and the trial.

    cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))
    
    estimator.fit(
        inputs={'training': inputs}, 
        job_name=cnn_training_job_name,
        experiment_config={
            "ExperimentName": mnist_experiment.experiment_name, 
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        }
    )
# end of loop

Once all jobs are complete, I can run analytics. Let’s find out how we did.

Analytics with Amazon SageMaker Experiments
All information on an experiment can be easily exported to a Pandas DataFrame.

from sagemaker.analytics import ExperimentAnalytics
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=mnist_experiment.experiment_name
)
analytic_table = trial_component_analytics.dataframe()

If I want to drill down, I can specify additional parameters, e.g.:

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=mnist_experiment.experiment_name,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=['test:accuracy'],
    parameter_names=['hidden_channels', 'epochs', 'dropout', 'optimizer']
)
analytic_table = trial_component_analytics.dataframe()

This builds a DataFrame where trials are sorted by decreasing test accuracy, and showing only some of the hyperparameters for each trial.

for col in analytic_table.columns: 
    print(col) 

TrialComponentName
DisplayName
SourceArn
dropout
epochs
hidden_channels
optimizer
test:accuracy - Min
test:accuracy - Max
test:accuracy - Avg
test:accuracy - StdDev
test:accuracy - Last
test:accuracy - Count

From here on, your imagination is the limit. Pandas is the Swiss army knife of data analysis, and you’ll be able to compare trials and experiments in every possible way.

Last but not least, thanks to the integration with Amazon SageMaker Studio, you’ll be able to visualize all this information in real-time with predefined widgets. To learn more about Amazon SageMaker Studio, visit this blog post.

Now Available!
I just scratched the surface of what you can do with Amazon SageMaker Experiments, and I believe it will help you tame the wild horde of jobs that you have to deal with everyday.

The service is available today in all commercial AWS Regions where Amazon SageMaker is available.

Give it a try and please send us feedback, either in the AWS forum for Amazon SageMaker, or through your usual AWS contacts.

– Julien

 

Now Available on Amazon SageMaker: The Deep Graph Library

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-on-amazon-sagemaker-the-deep-graph-library/

Today, we’re happy to announce that the Deep Graph Library, an open source library built for easy implementation of graph neural networks, is now available on Amazon SageMaker.

In recent years, Deep learning has taken the world by storm thanks to its uncanny ability to extract elaborate patterns from complex data, such as free-form text, images, or videos. However, lots of datasets don’t fit these categories and are better expressed with graphs. Intuitively, we can feel that traditional neural network architectures like convolution neural networks or recurrent neural networks are not a good fit for such datasets, and a new approach is required.

A Primer On Graph Neural Networks
Graph neural networks (GNN) are one of the most exciting developments in machine learning today, and these reference papers will get you started.

GNNs are used to train predictive models on datasets such as:

  • Social networks, where graphs show connections between related people,
  • Recommender systems, where graphs show interactions between customers and items,
  • Chemical analysis, where compounds are modeled as graphs of atoms and bonds,
  • Cybersecurity, where graphs describe connections between source and destination IP addresses,
  • And more!

Most of the time, these datasets are extremely large and only partially labeled. Consider a fraud detection scenario where we would try to predict the likelihood that an individual is a fraudulent actor by analyzing his connections to known fraudsters. This problem could be defined as a semi-supervised learning task, where only a fraction of graph nodes would be labeled (‘fraudster’ or ‘legitimate’). This should be a better solution than trying to build a large hand-labeled dataset, and “linearizing” it to apply traditional machine learning algorithms.

Working on these problems requires domain knowledge (retail, finance, chemistry, etc.), computer science knowledge (Python, deep learning, open source tools), and infrastructure knowledge (training, deploying, and scaling models). Very few people master all these skills, which is why tools like the Deep Graph Library and Amazon SageMaker are needed.

Introducing The Deep Graph Library
First released on Github in December 2018, the Deep Graph Library (DGL) is a Python open source library that helps researchers and scientists quickly build, train, and evaluate GNNs on their datasets.

DGL is built on top of popular deep learning frameworks like PyTorch and Apache MXNet. If you know either one or these, you’ll find yourself quite at home. No matter which framework you use, you can get started easily thanks to these beginner-friendly examples. I also found the slides and code for the GTC 2019 workshop very useful.

Once you’re done with toy examples, you can start exploring the collection of cutting edge models already implemented in DGL. For example, you can train a document classification model using a Graph Convolution Network (GCN) and the CORA dataset by simply running:

$ python3 train.py --dataset cora --gpu 0 --self-loop

The code for all models is available for inspection and tweaking. These implementations have been carefully validated by AWS teams, who verified performance claims and made sure results could be reproduced.

DGL also includes a collection of graph datasets, that you can easily download and experiment with.

Of course, you can install and run DGL locally, but to make your life simpler, we added it to the Deep Learning Containers for PyTorch and Apache MXNet. This makes it easy to use DGL on Amazon SageMaker, in order to train and deploy models at any scale, without having to manage a single server. Let me show you how.

Using DGL On Amazon SageMaker
We added complete examples in the Github repository for SageMaker examples: one of them trains a simple GNN for molecular toxicity prediction using the Tox21 dataset.

The problem we’re trying to solve is figuring it the potential toxicity of new chemical compounds with respect to 12 different targets (receptors inside biological cells, etc.). As you can imagine, this type of analysis is crucial when designing new drugs, and being able to quickly predict results without having to run in vitro experiments helps researchers focus their efforts on the most promising drug candidates.

The dataset contains a little over 8,000 compounds: each one is modeled as a graph (atoms are vertices, atomic bonds are edges), and labeled 12 times (one label per target). Using a GNN, we’re going to build a multi-label binary classification model, allowing us to predict the potential toxicity of candidate molecules.

In the training script, we can easily download the dataset from the DGL collection.

from dgl.data.chem import Tox21
dataset = Tox21()

Similarly, we can easily build a GNN classifier using the DGL model zoo.

from dgl import model_zoo
model = model_zoo.chem.GCNClassifier(
    in_feats=args['n_input'],
    gcn_hidden_feats=[args['n_hidden'] for _ in range(args['n_layers'])],
    n_tasks=dataset.n_tasks,
    classifier_hidden_feats=args['n_hidden']).to(args['device'])

The rest of the code is mostly vanilla PyTorch, and you should be able to find your bearings if you’re familiar with this library.

When it comes to running this code on Amazon SageMaker, all we have to do is use a SageMaker Estimator, passing the full name of our DGL container, and the name of the training script as a hyperparameter.

estimator = sagemaker.estimator.Estimator(container,
    role,
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters={'entrypoint': 'main.py'},
    sagemaker_session=sess)
code_location = sess.upload_data(CODE_PATH,
bucket=bucket,
key_prefix=custom_code_upload_location)
estimator.fit({'training-code': code_location})

<output removed>
epoch 23/100, batch 48/49, loss 0.4684

epoch 23/100, batch 49/49, loss 0.5389
epoch 23/100, training roc-auc 0.9451
EarlyStopping counter: 10 out of 10
epoch 23/100, validation roc-auc 0.8375, best validation roc-auc 0.8495
Best validation score 0.8495
Test score 0.8273
2019-11-21 14:11:03 Uploading - Uploading generated training model
2019-11-21 14:11:03 Completed - Training job completed
Training seconds: 209
Billable seconds: 209

Now, we could grab the trained model in S3, and use it to predict toxicity for large number of compounds, without having to run actual experiments. Fascinating stuff!

Now Available!
You can start using DGL on Amazon SageMaker today.

Give it a try, and please send us feedback in the DGL forum, in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

Julien