Tag Archives: machine learning

Split-Second Phantom Images Fool Autopilots

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/10/split-second-phantom-images-fool-autopilots.html

Researchers are tricking autopilots by inserting split-second images into roadside billboards.

Researchers at Israel’s Ben Gurion University of the Negev … previously revealed that they could use split-second light projections on roads to successfully trick Tesla’s driver-assistance systems into automatically stopping without warning when its camera sees spoofed images of road signs or pedestrians. In new research, they’ve found they can pull off the same trick with just a few frames of a road sign injected on a billboard’s video. And they warn that if hackers hijacked an internet-connected billboard to carry out the trick, it could be used to cause traffic jams or even road accidents while leaving little evidence behind.

[…]

In this latest set of experiments, the researchers injected frames of a phantom stop sign on digital billboards, simulating what they describe as a scenario in which someone hacked into a roadside billboard to alter its video. They also upgraded to Tesla’s most recent version of Autopilot known as HW3. They found that they could again trick a Tesla or cause the same Mobileye device to give the driver mistaken alerts with just a few frames of altered video.

The researchers found that an image that appeared for 0.42 seconds would reliably trick the Tesla, while one that appeared for just an eighth of a second would fool the Mobileye device. They also experimented with finding spots in a video frame that would attract the least notice from a human eye, going so far as to develop their own algorithm for identifying key blocks of pixels in an image so that a half-second phantom road sign could be slipped into the “uninteresting” portions.

The paper:

Abstract: In this paper, we investigate “split-second phantom attacks,” a scientific gap that causes two commercial advanced driver-assistance systems (ADASs), Telsa Model X (HW 2.5 and HW 3) and Mobileye 630, to treat a depthless object that appears for a few milliseconds as a real obstacle/object. We discuss the challenge that split-second phantom attacks create for ADASs. We demonstrate how attackers can apply split-second phantom attacks remotely by embedding phantom road signs into an advertisement presented on a digital billboard which causes Tesla’s autopilot to suddenly stop the car in the middle of a road and Mobileye 630 to issue false notifications. We also demonstrate how attackers can use a projector in order to cause Tesla’s autopilot to apply the brakes in response to a phantom of a pedestrian that was projected on the road and Mobileye 630 to issue false notifications in response to a projected road sign. To counter this threat, we propose a countermeasure which can determine whether a detected object is a phantom or real using just the camera sensor. The countermeasure (GhostBusters) uses a “committee of experts” approach and combines the results obtained from four lightweight deep convolutional neural networks that assess the authenticity of an object based on the object’s light, context, surface, and depth. We demonstrate our countermeasure’s effectiveness (it obtains a TPR of 0.994 with an FPR of zero) and test its robustness to adversarial machine learning attacks.

Pay as you go machine learning inference with AWS Lambda

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/pay-as-you-go-machine-learning-inference-with-aws-lambda/

This post is courtesy of Eitan Sela, Senior Startup Solutions Architect.

Many customers want to deploy machine learning models for real-time inference, and pay only for what they use. Using Amazon EC2 instances for real-time inference may not be cost effective to support sporadic inference requests throughout the day.

AWS Lambda is a serverless compute service with pay-per-use billing. However, ML frameworks like XGBoost are too large to fit into the 250 MB application artifact size limit, or the 512 MB /tmp space limit. While you can store the packages in Amazon S3 and download to Lambda (up to 3 GB), this can increase the cost.

To address this, Lambda functions can now mount an Amazon Elastic File System (EFS). This is a scalable and elastic NFS file system storing data within and across multiple Availability Zones (AZ) for high availability and durability.

With this new capability, it’s now easier to use Python packages in Lambda that require storage space to load models and other dependencies.

In this blog post, I walk through how to:

  • Create an EFS file system and an Access Point as an application-specific entry point.
  • Provision an EC2 instance, mount EFS using the Access Point, and train a breast cancer XGBoost ML model. XGBoost, Python packages, and the model are saved on the EFS file system.
  • Create a Lambda function that loads the Python packages and model from EFS, and performs the prediction based on a test event.

Create an Amazon EFS file system with an Access Point

Configuring EFS for Lambda is straight-forward. I show how to do this in the AWS CloudFormation but you can also use the AWS CLI, AWS SDK, and AWS Serverless Application Model (AWS SAM).

EFS file systems are created within a customer VPC, so Lambda functions using the EFS file system must have access to the same VPC.

You can deploy the AWS CloudFormation stack located on this GitHub repository.

The stack includes the following:

  • Create a VPC with public subnet.
  • Create an EFS file system
  • Create an EFS Access Point
  • Create an EC2 in the VPC

It can take up to 10 minutes for the CloudFormation stack to create the resources. After the resource creation is complete, navigate to the EFS console to see the new file system.

EFS console

Navigate to the Access Points panel to see a new Access Point with the File system ID from the previous page.

Access Points panel

Note the Access Point ID and File System ID for the following sections.

Launch an Amazon EC2 instance to train a breast cancer model

In this section, you install Python packages on the EFS file system, after mounting it to EC2. You then train the breast cancer model, and save the model in the EFS file system used by the Lambda function.

The machine learning framework you use for this function is XGBoost. This is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. XGBoost is one of the most popular machine learning algorithms.

Navigate to the EC2 console to see the new EC2 instance created from the CloudFormation stack. This is an Amazon Linux 2 c5.large EC2 instance named ‘xgboost-for-serverless-inference-cfn-ec2’. In the instance details, you see that the security group is configured to allow inbound SSH access (for connecting to the instance).

Security Groups on instances page

Mount the EFS file system on the EC2

Connect to the instance using SSH and mount the EFS file system previously created by using the Access Point:

  1. Install amazon-efs-utils tools:
    sudo yum -y install amazon-efs-utils
  2. Create a directory to mount EFS into:
    mkdir efs
  3. Mount the EFS file system using the Access Point:
    sudo mount -t efs -o tls,accesspoint=<Access point ID> <File system ID>:/ efs

Console output

Install Python, pip and required packages

  1. Install Python and pip:
    sudo yum -y install python37
    curl -O https://bootstrap.pypa.io/get-pip.py
    python3 get-pip.py --user
  2. Verify the installation:
    python3 --version
    pip3 --version
  3. Create a requirements.txt file containing the dependencies:
    xgboost==1.1.1
    pandas
    sklearn
    joblib
  4. Install the Python packages using the requirements file:
    pip3 install -t efs/lib/ -r requirements.txt

    Note: using bursting throughput mode with EFS File system, this action can take up to 10 minutes.
  5. Set the Python path to refer to the installed packages directory of EFS file system:
    export PYTHONPATH=/home/ec2-user/efs/lib/

Train the breast cancer model

The breast cancer model predicts whether the breast mass is a malignant tumor or benign by looking at features computed from a digitized image of a fine needle aspirate of a breast mass.

The data used to train the model consists of the diagnosis in addition to the 10 real-valued features that are computed for each cell nucleus. Such features include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The prediction returned by the model is either “B” for benign or “M” for malignant. This sample project uses the public Breast Cancer Wisconsin (Diagnostic) dataset.

After installing the required Python packages, train a XGBoost model on the breast cancer dataset:

  1. Create a bc_xgboost_train.py file containing the Python code needed to train a breast cancer XGBoost model. Download the code here.
  2. Start the training of the model:python3 bc_xgboost_train.pyYou see the following message:Console outputThe model file bc-xgboost-model is created in the root directory.
  3. Create a new directory on the EFS file system and copy the XGBoost breast cancer model:
    mkdir efs/model
    cp bc-xgboost-model efs/model/
  4. Check you have the required Python packages and the model on the EFS file system:
    ls efs/model/ efs/lib/

    You see all the Python packages installed previously in the lib directory, and the model file in the model directory.
  5. Review the total size of lib Python packages directory:
    du -sh efs/lib/

You can see that the total size of lib directory is 534 MB. This is a larger package size than was allowed before EFS for Lambda.

Building a serverless machine learning inference using Lambda

In this section, you use the EFS file system previously configured for the Lambda function to import the required libraries and load the model.

Using EFS with Lambda

The AWS SAM template creates the Lambda function, mount the EFS Access Point created earlier, and both IAM roles required.

It takes several minutes for the AWS SAM CLI to create the Lambda function. After, navigate to the Lambda console to see the created Lambda function.

Lambda console

In the Lambda function configuration, you see the environment variables, and basic settings, such as runtime, memory, and timeout.

Lambda function configuration

Further down, you see that the Lambda function has the VPC access configured, and the file system is mounted.

Lambda VPC configuration

Test your Lambda function

  1. In the Lambda console, select Configure test events from the Test events dropdown.
  2. For Event Name, enter InferenceTestEvent.
  3. Copy the event JSON from here and paste in the dialog box.Confiigure test event
  4. Choose Create. After saving, you see InferenceTestEvent in the Test list. Now choose Test.

You see the Lambda function inference result, log output, and duration:

Lambda function result

Conclusion

In this blog post, you train an XGBoost breast cancer model using Python packages installed on an Amazon EFS file system. You create an AWS Lambda function that loads the Python packages and the model from EFS file system, and perform the predictions.

Now you know how to call a machine learning model inference using a Lambda function. To learn more about other real-world examples, see:

AWS Architecture Monthly Magazine: Robotics

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-robotics/

Architecture Monthly: RoboticsSeptember’s issue of AWS Architecture Monthly issue is all about robotics. Discover why iRobot, the creator of your favorite (though maybe not your pet’s favorite) little robot vacuum, decided to move its mission-critical platform to the serverless architecture of AWS. Learn how and why you sometimes need to test in a virtual environment instead of a physical one. You’ll also have the opportunity to hear from technical experts from across the robotics industry who came together for the AWS Cloud Robotics Summit in August.

Our expert this month, Matt Hansen (who has dreamed of building robots since he was a teen), gives us his outlook for the industry and explains why cloud will be an essential part of that.

In September’s Robotics issue

  • Ask an Expert: Matt Hansen, Principle Solutions Architect
  • Blog: Testing a PR2 Robot in a Simulated Hospital
  • Case Study: iRobot
  • Blog: Introduction to Automatic Testing of Robotics Applications
  • Case Study: Multiply Labs Uses AWS RoboMaker to Manufacture Individualized Medicines
  • Demos & Videos: AWS Cloud Robotics Summit (August 18-19, 2020)
  • Related Videos: iRobot and ZS Associates

Survey opportunity

This month, we’re also asking you to take a 10-question survey about your experiences with this magazine. The survey is hosted by an external company (Qualtrics), so the below survey button doesn’t lead to our website. Please note that AWS will own the data gathered from this survey, and we will not share the results we collect with survey respondents. Your responses to this survey will be subject to Amazon’s Privacy Notice. Please take a few moments to give us your opinions.

How to access the magazine

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

AWS Architecture Monthly Magazine: Agriculture

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/aws-architecture-monthly-magazine-agriculture/

Architecture Monthly Magazine cover - AgricultureIn this month’s issue of AWS Architecture Monthly, Worldwide Tech Lead for Agriculture, Karen Hildebrand (who’s also a fourth generation farmer) refers to agriculture as “the connective tissue our world needs to survive.” As our expert for August’s Agriculture issue, she also talks about what role cloud will play in future development efforts in this industry and why developing personal connections with our AWS agriculture customers is one of the most important aspects of our jobs.

You’ll also buzz through the world of high tech beehives, milk the information about data analytics-savvy cows, and see what the reference architecture of a Smart Farm looks like.

In August’s issue Agriculture issue

  • Ask an Expert: Karen Hildebrand, AWS WW Agriculture Tech Leader
  • Customer Success Story: Tine & Crayon: Revolutionizing the Norwegian Dairy Industry Using Machine Learning on AWS
  • Blog Post: Beewise Combines IoT and AI to Offer an Automated Beehive
  • Reference Architecture:Smart Farm: Enabling Sensor, Computer Vision, and Edge Inference in Agriculture
  • Customer Success Story: Farmobile: Empowering the Agriculture Industry Through Data
  • Blog Post: The Cow Collar Wearable: How Halter benefits from FreeRTOS
  • Related Videos: DuPont, mPrest & Netafirm, and Veolia

Survey opportunity

This month, we’re also asking you to take a 10-question survey about your experiences with this magazine. The survey is hosted by an external company (Qualtrics), so the below survey button doesn’t lead to our website. Please note that AWS will own the data gathered from this survey, and we will not share the results we collect with survey respondents. Your responses to this survey will be subject to Amazon’s Privacy Notice. Please take a few moments to give us your opinions.

How to access the magazine

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

Computational Causal Inference at Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/computational-causal-inference-at-netflix-293591691c62

Jeffrey Wong, Colin McFarland

Every Netflix data scientist, whether their background is from biology, psychology, physics, economics, math, statistics, or biostatistics, has made meaningful contributions to the way Netflix analyzes causal effects. Scientists from these fields have made many advancements in causal effects research in the past few decades, spanning instrumental variables, forest methods, heterogeneous effects, time-dynamic effects, quantile effects, and much more. These methods can provide rich information for decision making, such as in experimentation platforms (“XP”) or in algorithmic policy engines.

We want to amplify the effectiveness of our researchers by providing them software that can estimate causal effects models efficiently, and can integrate causal effects into large engineering systems. This can be challenging when algorithms for causal effects need to fit a model, condition on context and possible actions to take, score the response variable, and compute differences between counterfactuals. Computation can explode and become overwhelming when this is done with large datasets, with high dimensional features, with many possible actions to choose from, and with many responses. In order to gain broad software integration of causal effects models, a significant investment in software engineering, especially in computation, is needed. To address the challenges, Netflix has been building an interdisciplinary field across causal inference, algorithm design, and numerical computing, which we now want to share with the rest of the industry as computational causal inference (CompCI). A whitepaper detailing the field can be found here.

Computational causal inference brings a software implementation focus to causal inference, especially in regards to high performance numerical computing. We are implementing several algorithms to be highly performant, with a low memory footprint. As an example, our XP is pivoting away from two sample t-tests to models that estimate average effects, heterogeneous effects, and time-dynamic treatment effects. These effects help the business understand the user base, different segments in the user base, and whether there are trends in segments over time. We also take advantage of user covariates throughout these models in order to increase statistical power. While this rich analysis helps to inform business strategy and increase member joy, the volume of the data demands large amounts of memory, and the estimation of the causal effects on such volume of data is computationally heavy.

In the past, the computations for covariate adjusted heterogeneous effects and time-dynamic effects were slow, memory heavy, hard to debug, a large source of engineering risk, and ultimately could not scale to many large experiments. Using optimizations from CompCI, we can estimate hundreds of conditional average effects and their variances on a dataset with 10 million observations in 10 seconds, on a single machine. In the extreme, we can also analyze conditional time dynamic treatment effects for hundreds of millions of observations on a single machine in less than one hour. To achieve this, we leverage a software stack that is completely optimized for sparse linear algebra, a lossless data compression strategy that can reduce data volume, and mathematical formulas that are optimized specifically for estimating causal effects. We also optimize for memory and data alignment.

This level of computing affords us a lot of luxury. First, the ability to scale complex models means we can deliver rich insights for the business. Second, being able to analyze large datasets for causal effects in seconds increases research agility. Third, analyzing data on a single machine makes debugging easy. Finally, the scalability makes computation for large engineering systems tractable, reducing engineering risk.

Computational causal inference is a new, interdisciplinary field we are announcing because we want to build it collectively with the broader community of experimenters, researchers, and software engineers. The integration of causal inference into engineering systems can lead to large amounts of new innovation. Being an interdisciplinary field, it truly requires the community of local, domain experts to unite. We have released a whitepaper to begin the discussion. There, we describe the rising demand for scalable causal inference in research and in software engineering systems. Then, we describe the state of common causal effects models. Afterwards, we describe what we believe can be a good software framework for estimating and optimizing for causal effects.

Finally, we close the CompCI whitepaper with a series of open challenges that we believe require an interdisciplinary collaboration, and can unite the community around. For example:

  1. Time dynamic treatment effects are notoriously hard to scale. They require a panel of repeated observations, which generate large datasets. They also contain autocorrelation, creating complications for estimating the variance of the causal effect. How can we make the computation for the time-dynamic treatment effect, and its distribution, more scalable?
  2. In machine learning, specifying a loss function and optimizing it using numerical methods allows a developer to interact with a single, umbrella framework that can span several models. Can such an umbrella framework exist to specify different causal effects models in a unified way? For example, could it be done through the generalized method of moments? Can it be computationally tractable?
  3. How should we develop software that understands if a causal parameter is identified? A solution to this helps to create software that is safe to use, and can provide safe, programmatic access to the analysis of causal effects. We believe there are many edge cases in identification that require an interdisciplinary group to solve.

We hope this begins the discussion, and over the coming months we will be sharing more on the research we have done to make estimation of causal effects performant. There are still many more challenges in the field that are not listed here. We want to form a community spanning experimenters, researchers, and software engineers to learn about problems and solutions together. If you are interested in being part of this community, please reach us at [email protected]


Computational Causal Inference at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

TensorFlow Serving on Kubernetes with Amazon EC2 Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/tensorflow-serving-on-kubernetes-spot-instances/

This post is contributed by Kinnar Sen – Sr. Specialist Solutions Architect, EC2 Spot

TensorFlow (TF) is a popular choice for machine learning research and application development. It’s a machine learning (ML) platform, which is used to build (train) and deploy (serve) machine learning models. TF Serving is a part of TF framework and is used for deploying ML models in production environments. TF Serving can be containerized using Docker and deployed in a cluster with Kubernetes. It is easy to run production grade workloads on Kubernetes using Amazon Elastic Kubernetes Service (Amazon EKS), a managed service for creating and managing Kubernetes clusters. To cost optimize the TF serving workloads, you can use Amazon EC2 Spot Instances. Spot Instances are spare EC2 capacity available at up to a 90% discount compared to On-Demand Instance prices.

In this post I will illustrate deployment of TensorFlow Serving using Kubernetes via Amazon EKS and Spot Instances to build a scalable, resilient, and cost optimized machine learning inference service.

Overview

About TensorFlow Serving (TF Serving)

TensorFlow Serving is the recommended way to serve TensorFlow models. A flexible and a high-performance system for serving models TF Serving enables users to quickly deploy models to production environments. It provides out-of-box integration with TF models and can be extended to serve other kinds of models and data. TF Serving deploys a model server with gRPC/REST endpoints and can be used to serve multiple models (or versions). There are two ways that the requests can be served, batching individual requests or one-by-one. Batching is often used to unlock the high throughput of hardware accelerators (if used for inference) for offline high volume inference jobs.

Amazon EC2 Spot Instances

Spot Instances are spare Amazon EC2 capacity that enables customers to save up to 90% over On-Demand Instance prices. The price of Spot Instances is determined by long-term trends in supply and demand of spare capacity pools. Capacity pools can be defined as a group of EC2 instances belonging to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

TensorFlow Serving (TF Serving) and Kubernetes

Each pod in a Kubernetes cluster runs a TF Docker image with TF Serving-based server and a model. The model contains the architecture of TensorFlow Graph, model weights and assets. This is a deployment setup with configurable number of replicas. The replicas are exposed externally by a service and an External Load Balancer that helps distribute the requests to the service endpoints. To keep up with the demands of service, Kubernetes can help scale the number of replicated pods using Kubernetes Replication Controller.

Architecture

There are a couple of goals that we want to achieve through this solution.

  • Cost optimization – By using EC2 Spot Instances
  • High throughput – By using Application Load Balancer (ALB) created by Ingress Controller
  • Resilience – Ensuring high availability by replenishing nodes and gracefully handling the Spot interruptions
  • Elasticity – By using Horizontal Pod Autoscaler, Cluster Autoscaler, and EC2 Auto Scaling

This can be achieved by using the following components.

ComponentRoleDetailsDeployment Method
Cluster AutoscalerScales EC2 instances automatically according to pods running in the clusterOpen sourceA deployment on On-Demand Instances
EC2 Auto Scaling groupProvisions and maintains EC2 instance capacityAWSAWS CloudFormation via eksctl
AWS Node Termination HandlerDetects EC2 Spot interruptions and automatically drains nodesOpen sourceA DaemonSet on Spot and On-Demand Instances
AWS ALB Ingress ControllerProvisions and maintains Application Load BalancerOpen sourceA deployment on On-Demand Instances

You can find more details about each component in this AWS blog. Let’s go through the steps that allow the deployment to be elastic.

  1. HTTP requests flows in through the ALB and Ingress object.
  2. Horizontal Pod Autoscaler (HPA) monitors the metrics (CPU / RAM) and once the threshold is breached a Replica (pod) is launched.
  3. If there are sufficient cluster resources, the pod starts running, else it goes into pending state.
  4. If one or more pods are in pending state, the Cluster Autoscaler (CA) triggers a scale up request to Auto Scaling group.
    1. If HPA tries to schedule pods more than the current size of what the cluster can support, CA can add capacity to support that.
  5. Auto Scaling group provision a new node and the application scales up
  6. A scale down happens in the reverse fashion when requests start tapering down.

AWS ALB Ingress controller and ALB

We will be using an ALB along with an Ingress resource instead of the default External Load Balancer created by the TF Serving deployment. The open source AWS ALB Ingress controller triggers the creation of an ALB and the necessary supporting AWS resources whenever a Kubernetes user declares an Ingress resource in the cluster. The Ingress resource uses the ALB to route HTTP(S) traffic to different endpoints within the cluster. ALB is ideal for advanced load balancing of HTTP and HTTPS traffic. ALB provides advanced request routing targeted at delivery of modern application architectures, including microservices and container-based applications. This allows the deployment to maintain a high throughput and improve load balancing.

Spot Instance interruptions

To gracefully handle interruptions, we will use the AWS node termination handler. This handler runs a DaemonSet (one pod per node) on each host to perform monitoring and react accordingly. When it receives the Spot Instance 2-minute interruption notification, it uses the Kubernetes API to cordon the node. This is done by tainting it to ensure that no new pods are scheduled there, then it drains it, removing any existing pods from the ALB.

One of the best practices for using Spot is diversification where instances are chosen from across different instance types, sizes, and Availability Zone. The capacity-optimized allocation strategy for EC2 Auto Scaling provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics, thus lowering the chance of interruptions.

Tutorial

Set up the cluster

We are using eksctl to create an Amazon EKS cluster with the name k8-tf-serving in combination with a managed node group. The managed node group has two On-Demand t3.medium nodes and it will bootstrap with the labels lifecycle=OnDemand and intent=control-apps. Be sure to replace <YOUR REGION> with the Region you are launching your cluster into.

eksctl create cluster --name=TensorFlowServingCluster --node-private-networking --managed --nodes=3 --alb-ingress-access --region=<YOUR REGION> --node-type t3.medium --node-labels="lifecycle=OnDemand,intent=control-apps" --asg-access

Check the nodes provisioned by using kubectl get nodes.

Create the NodeGroups now. You create the eksctl configuration file first. Copy the nodegroup configuration below and create a file named spot_nodegroups.yml. Then run the command using eksctl below to add the new Spot nodes to the cluster.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
    name: TensorFlowServingCluster
    region: <YOUR REGION>
nodeGroups:
    - name: prod-4vcpu-16gb-spot
      minSize: 0
      maxSize: 15
      desiredCapacity: 10
      instancesDistribution:
        instanceTypes: ["m5.xlarge", "m5d.xlarge", "m4.xlarge","t3.xlarge","t3a.xlarge","m5a.xlarge","t2.xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
      labels:
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
      tags:
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
      iam:
        withAddonPolicies:
          autoScaler: true
          albIngress: true
    - name: prod-8vcpu-32gb-spot
      minSize: 0
      maxSize: 15
      desiredCapacity: 10
      instancesDistribution:
        instanceTypes: ["m5.2xlarge", "m5n.2xlarge", "m5d.2xlarge", "m5dn.2xlarge","m5a.2xlarge", "m4.2xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
      labels:
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
      tags:
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
      iam:
        withAddonPolicies:
          autoScaler: true
          albIngress: true
eksctl create nodegroup -f spot_nodegroups.yml

A few points to note here, for more technical details refer to the EC2 Spot workshop.

  • There are two diversified node groups created with a fixed vCPU:Memory ratio. This adheres to the Spot best practice of diversifying instances, and helps the Cluster Autoscaler function properly.
  • Capacity-optimized Spot allocation strategy is used in both the node groups.

Once the nodes are created, you can check the number of instances provisioned using the command below. It should display 20 as we configured each of our two node groups with a desired capacity of 10 instances.

kubectl get nodes --selector=lifecycle=Ec2Spot | expr $(wc -l) - 1

The cluster setup is complete.

Install the AWS Node Termination Handler

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.3.1/all-resources.yaml

This installs the Node Termination Handler to both Spot Instance and On-Demand Instance nodes. This helps the handler responds to both EC2 maintenance events and Spot Instance interruptions.

Deploy Cluster Autoscaler

For additional detail, see the Amazon EKS page here. Next, export the Cluster Autoscaler into a configuration file:

curl -o cluster_autoscaler.yml https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/content/using_ec2_spot_instances_with_eks/scaling/deploy_ca.files/cluster_autoscaler.yml

Open the file created and edit.

Add AWS Region and the cluster name as depicted in the screenshot below.

Run the commands below to deploy Cluster Autoscaler.

<div class="hide-language"><pre class="unlimited-height-code"><code class="lang-yaml">kubectl apply -f cluster_autoscaler.yml</code></pre></div><div class="hide-language"><pre class="unlimited-height-code"><code class="lang-yaml">kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"</code></pre></div>

Use this command to see into the Cluster Autoscaler (CA) logs to find NodeGroups auto-discovered. Use Ctrl + C to abort the log view.

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=10

Deploy TensorFlow Serving

TensorFlow Model Server is deployed in pods and the model will load from the model stored in Amazon S3.

Amazon S3 access

We are using Kubernetes Secrets to store and manage the AWS Credentials for S3 Access.

Copy the following and create a file called kustomization.yml. Add the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY details in the file.

namespace: default
secretGenerator:
- name: s3-credentials
  literals:
  - AWS_ACCESS_KEY_ID=<<AWS_ACCESS_KEY_ID>>
  - AWS_SECRET_ACCESS_KEY=<<AWS_SECRET_ACCESS_KEY>>
generatorOptions:
  disableNameSuffixHash: true

Create the secret file and deploy.

kubectl kustomize . > secret.yaml
kubectl apply -f secret.yaml

We recommend to use Sealed Secret for production workloads, Sealed Secret provides a mechanism to encrypt a Secret object thus making it more secure. For further details please take a look at the AWS workshop here.

ALB Ingress Controller

Deploy RBAC Roles and RoleBindings needed by the AWS ALB Ingress controller.

kubectl apply -f

https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.4/docs/examples/rbac-role.yaml

Download the AWS ALB Ingress controller YAML into a local file.

curl -sS "https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.4/docs/examples/alb-ingress-controller.yaml" &gt; alb-ingress-controller.yaml

Change the –cluster-name flag to ‘TensorFlowServingCluster’ and add the Region details under – –aws-region. Also add the lines below just before the ‘serviceAccountName’.

nodeSelector:
    lifecycle: OnDemand

Deploy the AWS ALB Ingress controller and verify that it is running.

kubectl apply -f alb-ingress-controller.yaml
kubectl logs -n kube-system $(kubectl get po -n kube-system | grep alb-ingress | awk '{print $1}')

Deploy the application

Next, download a model as explained in the TF official documentation, then upload in Amazon S3.

mkdir /tmp/resnet

curl -s http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | \
tar --strip-components=2 -C /tmp/resnet -xvz

RANDOM_SUFFIX=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 10 | head -n 1)

S3_BUCKET="resnet-model-k8serving-${RANDOM_SUFFIX}"
aws s3 mb s3://${S3_BUCKET}
aws s3 sync /tmp/resnet/1538687457/ s3://${S3_BUCKET}/resnet/1/

Copy the following code and create a file named tf_deployment.yml. Don’t forget to replace <AWS_REGION> with the AWS Region you plan to use.

A few things to note here:

  • NodeSelector is used to route the TF Serving replica pods to Spot Instance nodes.
  • ServiceType LoadBalancer is used.
  • model_base_path is pointed at Amazon S3. Replace the <S3_BUCKET> with the S3_BUCKET name you created in last instruction set.
apiVersion: v1
kind: Service
metadata:
  labels:
    app: resnet-service
  name: resnet-service
spec:
  ports:
  - name: grpc
    port: 9000
    targetPort: 9000
  - name: http
    port: 8500
    targetPort: 8500
  selector:
    app: resnet-service
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: resnet-service
  name: resnet-v1
spec:
  replicas: 25
  selector:
    matchLabels:
      app: resnet-service
  template:
    metadata:
      labels:
        app: resnet-service
        version: v1
    spec:
      nodeSelector:
        lifecycle: Ec2Spot
      containers:
      - args:
        - --port=9000
        - --rest_api_port=8500
        - --model_name=resnet
        - --model_base_path=s3://<S3_BUCKET>/resnet/
        command:
        - /usr/bin/tensorflow_model_server
        env:
        - name: AWS_REGION
          value: <AWS_REGION>
        - name: S3_ENDPOINT
          value: s3.<AWS_REGION>.amazonaws.com
   - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: AWS_SECRET_ACCESS_KEY        
image: tensorflow/serving
        imagePullPolicy: IfNotPresent
        name: resnet-service
        ports:
        - containerPort: 9000
        - containerPort: 8500
        resources:
          limits:
            cpu: "4"
            memory: 4Gi
          requests:
            cpu: "2"
            memory: 2Gi

Deploy the application.

kubectl apply -f tf_deployment.yml

Copy the code below and create a file named ingress.yml.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: "resnet-service"
  namespace: "default"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
  labels:
    app: resnet-service
spec:
  rules:
    - http:
        paths:
          - path: "/v1/models/resnet:predict"
            backend:
              serviceName: "resnet-service"
              servicePort: 8500

Deploy the ingress.

kubectl apply -f ingress.yml

Deploy the Metrics Server and Horizontal Pod Autoscaler, which scales up when CPU/Memory exceeds 50% of the allocated container resource.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.6/components.yaml
kubectl autoscale deployment resnet-v1 --cpu-percent=50 --min=20 --max=100

Load testing

Download the Python helper file written for testing the deployed application.

curl -o submit_mc_tf_k8s_requests.py https://raw.githubusercontent.com/awslabs/ec2-spot-labs/master/tensorflow-serving-load-testing-sample/python/submit_mc_tf_k8s_requests.py

Get the address of the Ingress using the command below.

kubectl get ingress resnet-service

Install a Python Virtual Env. and install the library requirements.

pip3 install virtualenv
virtualenv venv
source venv/bin/activate
pip3 install tqdm
pip3 install requests

Run the following command to warm up the cluster after replacing the Ingress address. You will be running a Python application for predicting the class of a downloaded image against the ResNet model, which is being served by the TF Serving rest API. You are running multiple parallel processes for that purpose. Here “p” is the number of processes and “r” the number of requests for each process.

python submit_mc_tf_k8s_requests.py -p 100 -r 100 -u 'http://<INGRESS ADDRESS>:80/v1/models/resnet:predict'


You can use the command below to observe the scaling of the cluster.

kubectl get hpa -w

We ran the above again with 10,000 requests per process as to send 1 million requests to the application. The results are below:

The deployment was able to serve ~300 requests per second with an average latency of ~320 ms per requests.

Cleanup

Now that you’ve successfully deployed and ran TensorFlow Serving using Ec2 Spot it’s time to cleanup your environment. Remove the ingress, deployment, ingress-controller.

kubectl delete -f ingress.yml
kubectl delete -f tf_deployment.yml
kubectl delete -f alb-ingress-controller.yaml

Remove the model files from Amazon S3.

aws s3 rb s3://${S3_BUCKET}/ --force 

Delete the node groups and the cluster.

eksctl delete nodegroup -f spot_nodegroups.yml --approve
eksctl delete cluster --name TensorFlowServingCluster

Conclusion

In this blog, we demonstrated how TensorFlow Serving can be deployed onto Spot Instances based on a Kubernetes cluster, achieving both resilience and cost optimization. There are multiple optimizations that can be implemented on TensorFlow Serving that will further optimize the performance. This deployment can be extended and used for serving multiple models with different versions. We hope you consider running TensorFlow Serving using EC2 Spot Instances to cost optimize the solution.

Building a Self-Service, Secure, & Continually Compliant Environment on AWS

Post Syndicated from Japjot Walia original https://aws.amazon.com/blogs/architecture/building-a-self-service-secure-continually-compliant-environment-on-aws/

Introduction

If you’re an enterprise organization, especially in a highly regulated sector, you understand the struggle to innovate and drive change while maintaining your security and compliance posture. In particular, your banking customers’ expectations and needs are changing, and there is a broad move away from traditional branch and ATM-based services towards digital engagement.

With this shift, customers now expect personalized product offerings and services tailored to their needs. To achieve this, a broad spectrum of analytics and machine learning (ML) capabilities are required. With security and compliance at the top of financial service customers’ agendas, being able to rapidly innovate and stay secure is essential. To achieve exactly that, AWS Professional Services engaged with a major Global systemically important bank (G-SIB) customer to help develop ML capabilities and implement a Defense in Depth (DiD) security strategy. This blog post provides an overview of this solution.

The machine learning solution

The following architecture diagram shows the ML solution we developed for a customer. This architecture is designed to achieve innovation, operational performance, and security performance in line with customer-defined control objectives, as well as meet the regulatory and compliance requirements of supervisory authorities.

Machine learning solution developed for customer

This solution is built and automated using AWS CloudFormation templates with pre-configured security guardrails and abstracted through the service catalog. AWS Service Catalog allows you to quickly let your users deploy approved IT services ensuring governance, compliance, and security best practices are enforced during the provisioning of resources.

Further, it leverages Amazon SageMaker, Amazon Simple Storage Service (S3), and Amazon Relational Database Service (RDS) to facilitate the development of advanced ML models. As security is paramount for this workload, data in S3 is encrypted using client-side encryption and column-level encryption on columns in RDS. Our customer also codified their security controls via AWS Config rules to achieve continual compliance

Compute and network isolation

To enable our customer to rapidly explore new ML models while achieving the highest standards of security, separate VPCs were used to isolate infrastructure and accessed control by security groups. Core to this solution is Amazon SageMaker, a fully managed service that provides the ability to rapidly build, train, and deploy ML models. Amazon SageMaker notebooks are managed Juypter notebooks that:

  1. Prepare and process data
  2. Write code to train models
  3. Deploy models to SageMaker hosting
  4. Test or validate models

In our solution, notebooks run in an isolated VPC with no egress connectivity other than VPC endpoints, which enable private communication with AWS services. When used in conjunction with VPC endpoint policies, you can use notebooks to control access to those services. In our solution, this is used to allow the SageMaker notebook to communicate only with resources owned by AWS Organizations through the use of the aws:PrincipalOrgID condition key. AWS Organizations helps provide governance to meet strict compliance regulation and you can use the aws:PrincipalOrgID condition key in your resource-based policies to easily restrict access to Identity Access Management (IAM) principals from accounts.

Data protection

Amazon S3 is used to store training data, model artifacts, and other data sets. Our solution uses server-side encryption with customer master keys (CMKs) stored in AWS Key Management Service (SSE-KMS) encryption to protect data at rest. SSE-KMS leverages KMS and uses an envelope encryption strategy with CMKs. Envelop encryption is the practice of encrypting data with a data key and then encrypting that data key using another key – the CMK. CMKs are created in KMS and never leave KMS unencrypted. This approach allows fine-grained control around access to the CMK and the logging of all access and attempts to access the key to Amazon CloudTrail. In our solution, the age of the CMK is tracked by AWS Config and is regularly rotated. AWS Config enables you to assess, audit, and evaluate the configurations of deployed AWS resources by continuously monitoring and recording AWS resource configurations. This allows you to automate the evaluation of recorded configurations against desired configurations.

Amazon S3 Block Public Access is also used at an account level to ensure that existing and newly created resources block bucket policies or access-control lists (ACLs) don’t allow public access. Service control policies (SCPs) are used to prevent users from modifying this setting. AWS Config continually monitors S3 and remediates any attempt to make a bucket public.

Data in the solution are classified according to their sensitivity that corresponds to your customer’s data classification hierarchy. Classification in the solution is achieved through resource tagging, and tags are used in conjunction with AWS Config to ensure adherence to encryption, data retention, and archival requirements.

Continuous compliance

Our solution adopts a continuous compliance approach, whereby the compliance status of the architecture is continuously evaluated and auto-remediated if a configuration change attempts to violate the compliance posture. To achieve this, AWS Config and config rules are used to confirm that resources are configured in compliance with defined policies. AWS Lambda is used to implement a custom rule set that extends the rules included in AWS Config.

Data exfiltration prevention

In our solution, VPC Flow Logs are enabled on all accounts to record information about the IP traffic going to and from network interfaces in each VPC. This allows us to watch for abnormal and unexpected outbound connection requests, which could be an indication of attempts to exfiltrate data. Amazon GuardDuty analyzes VPC Flow Logs, AWS CloudTrail event logs, and DNS logs to identify unexpected and potentially malicious activity within the AWS environment. For example, GuardDuty can detect compromised Amazon Elastic Cloud Compute (EC2) instances communicating with known command-and-control servers.

Conclusion

Financial services customers are using AWS to develop machine learning and analytics solutions to solve key business challenges while ensuring security and compliance needs. This post outlined how Amazon SageMaker, along with multiple security services (AWS Config, GuardDuty, KMS), enables building a self-service, secure, and continually compliant data science environment on AWS for a financial service use case.

 

Machine Learning for a Better Developer Experience

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/machine-learning-for-a-better-developer-experience-1e600c69f36c

Stanislav Kirdey, William High

Imagine having to go through 2.5GB of log entries from a failed software build — 3 million lines — to search for a bug or a regression that happened on line 1M. It’s probably not even doable manually! However, one smart approach to make it tractable might be to diff the lines against a recent successful build, with the hope that the bug produces unusual lines in the logs.

Standard md5 diff would run quickly but still produce at least hundreds of thousands candidate lines to look through because it surfaces character-level differences between lines. Fuzzy diffing using k-nearest neighbors clustering from machine learning (the kind of thing logreduce does) produces around 40,000 candidate lines but takes an hour to complete. Our solution produces 20,000 candidate lines in 20 min of computing — and thanks to the magic of open source, it’s only about a hundred lines of Python code.

The application is a combination of neural embeddings, which encode the semantic information in words and sentences, and locality sensitive hashing, which efficiently assigns approximately nearby items to the same buckets and faraway items to different buckets. Combining embeddings with LSH is a great idea that appears to be less than a decade old.

Note — we used Tensorflow 2.2 on CPU with eager execution for transfer learning and scikit-learn NearestNeighbor for k-nearest-neighbors. There are sophisticated approximate nearest neighbors implementations that would be better for a model-based nearest neighbors solution.

What embeddings are and why we needed them

Assembling a k-hot bag-of-words is a typical (useful!) starting place for deduplication, search, and similarity problems around un- or semi-structured text. This type of bag-of-words encoding looks like a dictionary with individual words and their counts. Here’s what it would look like for the sentence “log in error, check log”.

{“log”: 2, “in”: 1, “error”: 1, “check”: 1}

This encoding can also be represented using a vector where the index corresponds to a word and the value is the count. Here is “log in error, check log” as a vector, where the first entry is reserved for “log” word counts, the second for “in” word counts, and so forth.

[2, 1, 1, 1, 0, 0, 0, 0, 0, …]

Notice that the vector consists of many zeros. Zero-valued entries represent all the other words in the dictionary that were not present in that sentence. The total number of vector entries possible, or dimensionality of the vector, is the size of your language’s dictionary, which is often millions or more but down to hundreds of thousands with some clever tricks.

Now let’s look at the dictionary and vector representations of “problem authenticating”. The words corresponding to the first five vector entries do not appear at all in the new sentence.

{“problem”: 1, “authenticating”: 1}
[0, 0, 0, 0, 1, 1, 0, 0, 0, …]

These two sentences are semantically similar, which means they mean essentially the same thing, but lexically are as different as they can be, which is to say they have no words in common. In a fuzzy diff setting, we might want to say that these sentences are too similar to highlight, but md5 and k-hot document encoding with kNN do not support that.

Erica Sinclair is helping us get LSH probabilities right

Dimensionality reduction uses linear algebra or artificial neural networks to place semantically similar words, sentences, and log lines near to each other in a new vector space, using representations known as embeddings. In our example, “log in error, check log” might have a five-dimensional embedding vector

[0.1, 0.3, -0.5, -0.7, 0.2]

and “problem authenticating” might be

[0.1, 0.35, -0.5, -0.7, 0.2]

These embedding vectors are near to each other by distance measures like cosine similarity, unlike their k-hot bag-of-word vectors. Dense, low dimensional representations are really useful for short documents, like lines of a build or a system log.

In reality, you’d be replacing the thousands or more dictionary dimensions with just 100 information-rich embedding dimensions (not five). State-of-the-art approaches to dimensionality reduction include singular value decomposition of a word co-occurrence matrix (GloVe) and specialized neural networks (word2vec, BERTELMo).

Erica Sinclair and locality-preserving hashing

What about clustering? Back to the build log application

We joke internally that Netflix is a log-producing service that sometimes streams videos. We deal with hundreds of thousands of requests per second in the fields of exception monitoring, log processing, and stream processing. Being able to scale our NLP solutions is just a must-have if we want to use applied machine learning in telemetry and logging spaces. This is why we cared about scaling our text deduplication, semantic similarity search, and textual outlier detection — there is no other way if the business problems need to be solved in real-time.

Our diff solution involves embedding each line into a low dimensional vector and (optionally “fine-tuning” or updating the embedding model at the same time), assigning it to a cluster, and identifying lines in different clusters as “different”. Locality sensitive hashing is a probabilistic algorithm that permits constant time cluster assignment and near-constant time nearest neighbors search.

LSH works by mapping a vector representation to a scalar number, or more precisely a collection of scalars. While standard hashing algorithms aim to avoid collisions between any two inputs that are not the same, LSH aims to avoid collisions if the inputs are far apart and promote them if they are different but near to each other in the vector space.

The embedding vector for “log in error, check log” might be mapped to binary number 01 — and 01 then represents the cluster. The embedding vector for “problem authenticating” would with high probability be mapped to the same binary number, 01. This is how LSH enables fuzzy matching, and the inverse problem, fuzzing diffing. Early applications of LSH were over high dimensional bag-of-words vector spaces — we couldn’t think of any reason it wouldn’t work on embedding spaces just as well, and there are signs that others have had the same thought.

Using LSH to place characters in the same bucket but in the Upside Down.

The work we did on applying LSH and neural embeddings in-text outlier detection on build logs now allows an engineer to look through a small fraction of the log’s lines to identify and fix errors in potentially business-critical software, and it also allows us to achieve semantic clustering of almost any log line in real-time.

We now bring this benefit from semantic LSH to every build at Netflix. The semantic part lets us group seemingly dissimilar items based on their meanings and surface them in outlier reports.

Conclusion

The mature state of open source transfer learning data products and SDKs has allowed us to solve semantic nearest neighbor search via LSH in remarkably few lines of code. We became especially interested in investigating the special benefits that transfer learning and fine-tuning might bring to the application. We’re excited to have an opportunity to solve such problems and to help people do what they do better and faster than before.

We hope you’ll consider joining Netflix and becoming one of the stunning colleagues whose life we make easier with machine learning. Inclusion is a core Netflix value and we are particularly interested in fostering a diversity of perspectives on our technical teams. So if you are in analytics, engineering, data science, or any other field and have a background that is atypical for the industry we’d especially like to hear from you!

If you have any questions about opportunities at Netflix, please reach out to the authors on LinkedIn.


Machine Learning for a Better Developer Experience was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep learning cat prey detector

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/deep-learning-cat-prey-detector/

We’ve all been able to check on our kitties’ outdoor activities for a while now, thanks to motion-activated cameras. And the internet’s favourite cat flap even live-tweets when it senses paws through the door.

A nightvision image of a cat approaching a cat flap with a mouse in its mouth

“Did you already make dinner? I stopped on the way home to pick this up for you.”

But what’s eluded us “owners” of felines up until now is the ability to stop our furry companions from bringing home mauled presents we neither want nor asked for.

A cat flap bouncer powered by deep learning

Now this Raspberry Pi–powered machine learning build, shared by reddit user u/eee_bume, can help us out: at its heart, there’s a convolutional neural network cascade that detects whether a cat is trying to enter a cat flap with something in its maw. (No word from the creators on how many half-consumed rodents the makers had to dispose of while training the machine learning model.)

The neural network first detects the whole cat in an image; then it hones in on the cat’s maw. Image classification is performed to detect whether there is anything in or around the maw. If the network thinks the cat is trying to smuggle caught contraband into the house, it’s a “no” from this virtual door bouncer.

The system runs on Raspberry Pi 4 with an infrared camera at an average detection rate of  around 1 FPS. The PC-Val value, representing the certainty of the prey classification => prey/no_prey certainty threshold, is 0.5.

The home made set up including small camera lights and sensors

The infrared camera setup, powered by Raspberry Pi

How to get enough training data

This project formed Nicolas Baumann’s and Michael Ganz’s spring semester thesis at the Swiss Federal Institute of Technology. One of the problems they ran into while trying to train their device is that cats are only expected to enter the cat flap carrying prey 3% of the time, which leads to a largely imbalanced classification problem. It would have taken a loooong time if they had just waited for Nicolas and Michael’s pets to bring home enough decomposing gifts.

Lots of different cats faces close up, some with prey in their mouths, some without

The cutest mugshots you ever did see

To get around this, they custom-built a scalable image data gathering network to simplify and maximise the collection of training data. It features multiple distributed Camera Nodes (CN), a centralised main archive, and a custom labeling tool. As a result of the data gathering network, 40GB of training data have been amassed.

What is my cat eating?!

The makers also took the time to train their neural network to classify different types of prey. So far, it recognises mice, lizards, slow-worms, and birds.

Infrared shots of one cat while the camera decides if it has prey in its mouth or not

“Come ooooon, it’s not even a *whole* mouse, let me in!”

It’s still being tweaked, but at the moment the machine learning model correctly detects when a cat has prey in its mouth 93% of the time. But it still falsely accuses kitties 28% of the time. We’ll leave it to you to decide whether your feline companion will stand for that kind of false positive rate, or whether it’s more than your job’s worth.

The post Deep learning cat prey detector appeared first on Raspberry Pi.

Using data science and machine learning for improved customer support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/using-data-science-and-machine-learning-for-improved-customer-support/

Using data science and machine learning for improved customer support

In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic.

Through these examples, we hope to demonstrate how invaluable data processing tricks, visualisations and tools can be before putting data into a machine learning algorithm. By refining data prior to processing, we are able to achieve dramatically improved results without needing to change the underlying machine learning strategies which are used.

Know the Limits (Language Classification)

When browsing a social media site, you may find the site prompts you to translate a post even though it is in your language.

We recently came across a similar problem at Cloudflare when we were looking into language classification for chat support messages. Using an off-the-shelf classification algorithm, users with short messages often had their chats classified incorrectly and our analysis found there’s a correlation between the length of a message and the accuracy of the classification (based on the browser Accept-Language header and the languages of the country where the request was submitted):

Using data science and machine learning for improved customer support

On a subset of tickets, comparing the classified language against the web browser Accept-Language header, we found there was broad agreement between these two properties. When we considered the languages associated with the user’s country, we found another signal.

Using data science and machine learning for improved customer support

In 67% of our sample, we found agreement between these three signals. In 15% of instances the classified language agreed with only the Accept-Language header and in 5% of cases there was only agreement with the languages associated with the user’s country.

Using data science and machine learning for improved customer support

We decided the ideal approach was to train a machine learning model that would take all three signals (plus the confidence rate from the language classification algorithm) and use that to make a prediction. By knowing the limits of a given classification algorithm, we were able to develop an approach that helped compliment it.

A naive approach to do the same may not even need a trained model to do so, simply requiring agreement between two of three properties (classified language, Accept-Language header and country header) helps make a decision about the right language to use.

Hold Your Fire (Fuzzy String Matching)

Fuzzy String Matching is often used in natural language processing when trying to extract information from human text. For example, this can be used for extracting error messages from customer support tickets to do automatic classification. At Cloudflare, we use this as one signal in our natural language processing pipeline for support tickets.

Engineers often use the Levenshtein distance algorithm for string matching; for example, this algorithm is implemented in the Python fuzzywuzzy library. This approach has a high computational overhead (for two strings of length k and l, the algorithm runs in O(k * l) time).

To understand the performance of different string matching algorithms in a customer support context, we compared multiple algorithms (Cosine, Dice, Damerau, LCS and Levenshtein) and measured the true positive rate (TP), false positive rate (FP) and the ratio of false positives to true positives (FP/TP).

Using data science and machine learning for improved customer support

We opted for the Cosine algorithm, not just because it outperformed the Levenshtein algorithm, but also the computational difficulty was reduced to O(k + l) time. The Cosine similarity algorithm is a very simple algorithm; it works by representing words or phrases as a vector representation in a multidimensional vector space, where each unique letter of an alphabet is a separate dimension. The smaller the angle between the two vectors, the closer the word is to another.

The mathematical definitions of each string similarity algorithm and a scientific comparison can be found in our paper: M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019, pp. 36-41. https://doi.org/10.1109/CoDIT.2019.8820497

There were other optimisations we introduce to the fuzzy string matching approaches; the similarity threshold is determined by evaluating the True Positive and False Positive rates on various sample data. We further devised a new tokenization approach for handling phrases and numeric strings whilst using the FastText natural language processing library to determine candidate values for fuzzy string matching and to improve overall accuracy, we will share more about these optimisations in a further blog post.

Using data science and machine learning for improved customer support

“Beyond it is Another Dimension” (Threat Identification)

Attack alerting is particularly important at Cloudflare – this is useful for both monitoring the overall status of our network and providing proactive support to particular at-risk customers.

DDoS attacks can be represented in granularity by a few different features; including differences in request or error rates over a temporal baseline, the relationship between errors and request volumes and other metrics that indicate attack behaviour. One example of a metric we use to differentiate between whether a customer is under a low volume attack or they are experiencing another issue is the relationship between 499 error codes vs 5xx HTTP status codes. Cloudflare’s network edge returns a 499 status code when the client disconnects before the origin web server has an opportunity to respond, whilst 5xx status codes indicate an error handling the request. In the chart below; the x-axis measures the differential increase in 5xx errors over a baseline, whilst the y-axis represents the rate of 499 responses (each scatter represents a 15 minute interval). During a DDoS attack we notice a linear correlation between these criteria, whilst origin issues typically have an increase in one metric instead of another:

Using data science and machine learning for improved customer support

The next question is how this data can be used in more complicated situations – take the following example of identifying a credential stuffing attack in aggregate. We looked at a small number of anonymised data fields for the most prolific attackers of WordPress login portals. The data is based purely on HTTP headers, in total we saw 820 unique IPs towards 16,248 distinct zones (the IPs were hashed and requests were placed into “buckets” as they were collected). As WordPress returns a HTTP 200 when a login fails and a HTTP 302 on a successful login (redirecting to the login panel), we’re able to analyse this just from the status code returned.

On the left hand chart, the x-axis represents a normalised number of unique zones that are under attack (0 means the attacker is hitting the same site whilst 1 means the attacker is hitting all different sites) and the y-axis represents the success rate (using HTTP status codes, identifying the chance of a successful login). The right hand side chart switches the x-axis out for something called the “variety ratio” – this measures the rate of abnormal 4xx/5xx HTTP status codes (i.e. firewall blocks, rate limiting HTTP headers or 5xx status codes). We see clear clusters on both charts:

Using data science and machine learning for improved customer support

However, by plotting this chart in three dimensions with all three fields represented – clusters appear. These clusters are then grouped using an unsupervised clustering algorithm (agglomerative hierarchical clustering):

Using data science and machine learning for improved customer support

Cluster 1 has 99.45% of requests from the same country and 99.45% from the same User-Agent. This tactic, however, has advantages when looking at other clusters – for example, Cluster 0 had 89% of requests coming from three User-Agents (75%, 12.3% and 1.7%, respectively). By using this approach we are able to correlate such attacks together even when they would be hard to identify on a request-to-request basis (as they are being made from different IPs and with different request headers). Such strategies allow us to fingerprint attacks regardless of whether attackers are continuously changing how they make these requests to us.

By aggregating data together then representing the data in multiple dimensions, we are able to gain visibility into the data that would ordinarily not be possible on a request-to-request basis. In product level functionality, it is often important to make decisions on a signal-to-signal basis (“should this request be challenged whilst this one is allowed?”) but by looking at the data in aggregate we are able to focus  on the interesting clusters and provide alerting systems which identify anomalies. Performing this in multiple dimensions provides the tools to reduce false positives dramatically.

Conclusion

From natural language processing to intelligent threat fingerprinting, using data science techniques has improved our ability to build new functionality. Recently, new machine learning approaches and strategies have been designed to process this data more efficiently and effectively; however, preprocessing of data remains a vital tool for doing this. When seeking to optimise data processing pipelines, it often helps to look not just at the tools being used, but also the input and structure of the data you seek to process.

If you’re interested in using data science techniques to identify threats on a large scale network, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Learning AI at school — a peek into the black box

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/research-seminar-learning-ai-at-school/

“In the near future, perhaps sooner than we think, virtually everyone will need a basic understanding of the technologies that underpin machine learning and artificial intelligence.” — from the 2018 Informatics Europe & EUACM report about machine learning

As the quote above highlights, AI and machine learning (ML) are increasingly affecting society and will continue to change the landscape of work and leisure — with a huge impact on young people in the early stages of their education.

But how are we preparing our young people for this future? What skills do they need, and how do we teach them these skills? This was the topic of last week’s online research seminar at the Raspberry Pi Foundation, with our guest speaker Juan David Rodríguez Garcia. Juan’s doctoral studies around AI in school complement his work at the Ministry of Education and Vocational Training in Spain.

Juan David Rodríguez Garcia

Juan’s LearningML tool for young people

Juan started his presentation by sharing numerous current examples of AI and machine learning, which young people can easily relate to and be excited to engage with, and which will bring up ethical questions that we need to be discussing with them.

Of course, it’s not enough for learners to be aware of AI applications. While machine learning is a complex field of study, we need to consider what aspects of it we can make accessible to young people to enable them to learn about the concepts, practices, and skills underlying it. During his talk Juan demonstrated a tool called LearningML, which he has developed as a practical introduction to AI for young people.

Screenshot of a demo of Juan David Rodríguez Garcia's LearningML tool

Juan demonstrates image recognition with his LearningML tool

LearningML takes inspiration from some of the other in-development tools around machine learning for children, such as Machine Learning for Kids, and it is available in one integrated platform. Juan gave an enticing demo of the tool, showing how to use visual image data (lots of pictures of Juan with hats, glasses on, etc.) to train and test a model. He then demonstrated how to use Scratch programming to also test the model and apply it to new data. The seminar audience was very positive about the LearningML, and of course we’d like it translated into English!

Juan’s talk generated many questions from the audience, from technical questions to the key question of the way we use the tool to introduce children to bias in AI. Seminar participants also highlighted opportunities to bring machine learning to other school subjects such as science.

AI in schools — what and how to teach

Machine learning demonstrates that computers can learn from data. This is just one of the five big ideas in AI that the AI4K12 group has identified for teaching AI in school in order to frame this broad domain:

  1. Perception: Computers perceive the world using sensors
  2. Representation & reasoning: Agents maintain models/representations of the world and use them for reasoning
  3. Learning: Computers can learn from data
  4. Natural interaction: Making agents interact comfortably with humans is a substantial challenge for AI developers
  5. Societal impact: AI applications can impact society in both positive and negative ways

One general concern I have is that in our teaching of computing in school (if we touch on AI at all), we may only focus on the fifth of the ‘big AI ideas’: the implications of AI for society. Being able to understand the ethical, economic, and societal implications of AI as this technology advances is indeed crucial. However, the principles and skills underpinning AI are also important, and how we introduce these at an age-appropriate level remains a significant question.

Illustration of AI, Image by Seanbatty from Pixabay

There are some great resources for developing a general understanding of AI principles, including unplugged activities from Computer Science For Fun. Yet there’s a large gap between understanding what AI is and has the potential to do, and actually developing the highly mathematical skills to program models. It’s not an easy issue to solve, but Juan’s tool goes a little way towards this. At the Raspberry Pi Foundation, we’re also developing resources to bridge this educational gap, including new online projects building on our existing machine learning projects, and an online course. Watch this space!

AI in the school curriculum and workforce

All in all, we seem to be a long way off introducing AI into the school curriculum. Looking around the world, in the USA, Hong Kong, and Australia there have been moves to introduce AI into K-12 education through pilot initiatives, and hopefully more will follow. In England, with a computing curriculum that was written in 2013, there is no requirement to teach any AI or machine learning, or even to focus much on data.

Let’s hope England doesn’t get left too far behind, as there is a massive AI skills shortage, with millions of workers needing to be retrained in the next few years. Moreover, a recent House of Lords report outlines that introducing all young people to this area of computing also has the potential to improve diversity in the workforce — something we should all be striving towards.

We look forward to hearing more from Juan and his colleagues as this important work continues.

Next up in our seminar series

If you missed the seminar, you can find Juan’s presentation slides and a recording of his talk on our seminars page.

In our next seminar on Tuesday 2 June at 17:00–18:00 BST / 12:00–13:00 EDT / 9:00–10:00 PDT / 18:00–19:00 CEST, we’ll welcome Dame Celia Hoyles, Professor of Mathematics Education at University College London. Celia will be sharing insights from her research into programming and mathematics. To join the seminar, simply sign up with your name and email address and we’ll email the link and instructions. If you attended Juan’s seminar, the link remains the same.

The post Learning AI at school — a peek into the black box appeared first on Raspberry Pi.

Introducing the Well-Architected Framework for Machine Learning

Post Syndicated from Shelbee Eigenbrode original https://aws.amazon.com/blogs/architecture/introducing-the-well-architected-framework-for-machine-learning/

We have published a new whitepaper, Machine Learning Lens, to help you design your machine learning (ML) workloads following cloud best practices. This whitepaper gives you an overview of the iterative phases of ML and introduces you to the ML and artificial intelligence (AI) services available on AWS using scenarios and reference architectures.

How often are you asking yourself “Am I doing this right?” when building and running applications in the cloud? You are not alone. To help you answer this question, we released the AWS Well-Architected Framework in 2015. The Framework is a formal approach for comparing your workload against AWS best practices and getting guidance on how to improve it.

We added “lenses” in 2017 that extended the Framework beyond its general perspective to provide guidance for specific technology domains, such as the Serverless Applications Lens, High Performance Computing (HPC) Lens, and IoT (Internet of Things) Lens. The new Machine Learning Lens further extends that guidance to include best practices specific to machine learning workloads.

A typical question we often hear is: “Should I use both the Lens whitepaper and the AWS Well-Architected Framework whitepaper when reviewing a workload?” Yes! We purposely exclude topics covered by the Framework in the Lens whitepapers. To fully evaluate your workload, use the Framework along with any applicable Lens whitepapers.

Applying the Machine Learning Lens

In the Machine Learning Lens, we focus on how to design, deploy, and architect your machine learning workloads in the AWS Cloud. The whitepaper starts by describing the general design principles for ML workloads. We then discuss the design principles for each of the five pillars of the Framework—operational excellence, security, reliability, performance efficiency, and cost optimization—as they relate to ML workloads.

Although the Lens is written to provide guidance across all pillars, each pillar is designed to be consumable by itself. This design results in some intended redundancy across the pillars. The following figure shows the structure of the whitepaper.

ML Lens-Well Architected (1)

The primary components of the AWS Well-Architected Framework are Pillars, Design Principles, Questions, and Best Practices. These components are illustrated in the following figure and are outlined in the AWS Well-Architected Framework whitepaper.

WA-Failure Management (2)

The Machine Learning Lens follows this pattern, with Design Principles, Questions, and Best Practices tailored for machine learning workloads. Each pillar has a set of questions, mapped to the design principles, which drives best practices for ML workloads.

ML Lens-Well Architected (2)

To review your ML workloads, start by answering the questions in each pillar. Identify opportunities for improvement as well as critical items for remediation. Then make a prioritized plan to address the improvement opportunities and remediation items.

We recommend that you regularly evaluate your workloads. Use the Lens throughout the lifecycle of your ML workload—not just at the end when you are about to go into production. When used during the design and implementation phase, the Machine Learning Lens can help you identify potential issues early, so that you have more time to address them.

Available now

The Machine Learning Lens is available now. Start using it today to review your existing workloads or to guide you in building new ones. Use the Lens to ensure that your ML workloads are architected with operational excellence, security, reliability, performance efficiency, and cost optimization in mind.

Cloudflare Bot Management: machine learning and more

Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/cloudflare-bot-management-machine-learning-and-more/

Cloudflare Bot Management: machine learning and more

Introduction

Cloudflare Bot Management: machine learning and more

Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time.

This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform.

Let’s start with Bot Management’s nomenclature.

Some Definitions

Bot – an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user’s behavior, performing repetitive tasks much faster than human users could.

Good bots – bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot.

Bad bots – bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots.

Cloudflare Bot Management: machine learning and more

Bot Management – blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity.

WAF – a security system that monitors and controls network traffic based on a set of security rules.

Gathering requirements

Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements.

  • Complete, not complex – customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs.
  • Trustworthy – customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level.
  • Flexible – customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide.
  • Accurate – Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots.
  • Recoverable – in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through.

Moreover, the goal for new Bot Management product was to make it work well on the following use cases:

Cloudflare Bot Management: machine learning and more

Technical requirements

Additionally to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were:

  • Scalability – the platform should be able to calculate a score on every request, even at over 10 million requests per second.
  • Low latency – detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware.
  • Configurability – it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level.
  • Modifiability – the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs.
  • Security – no sensitive information from one customer should be used to build models that protect another customer.
  • Explainability & debuggability – we should be able to explain and tune predictions in an intuitive way.

Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management.

Meet the Score

“Simplicity is the ultimate sophistication.”
– Leonardo Da Vinci

Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way?

Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable?

When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above?

As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Scoreone thing to rule them all – indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score).

Cloudflare Bot Management: machine learning and more
“One Ring to rule them all” by idreamlikecrazy, used under CC BY / Desaturated from original

Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer’s Internet property.

Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark®  language. Firewall rules supported a variety of request fields, and we thought – why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born.

Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike – it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration.

So how can we produce this score and how hard is it? Let’s explore it in the following section.

Architecture overview

What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services):

Cloudflare Bot Management: machine learning and more

Core Bot Management services

In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies:

  • Databases & data storesKafka, ClickHouse, Postgres, Redis, Ceph.
  • Programming languages – Go, Rust, Python, Java, Bash.
  • Configuration & schema management – Salt, Quicksilver, Cap’n Proto.
  • Containerization – Docker, Kubernetes, Helm, Mesos/Marathon.

Each of these services is built with resilience, performance, observability and security in mind.

Edge Bot Management module

All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output:

Firewall fieldsBot Management fields
cf.bot_management.score – an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score).
cf.bot_management.verified_bot – a boolean indicating whether such request comes from a Cloudflare whitelisted bot.
cf.bot_management.static_resource – a boolean indicating whether request matches file extensions for many types of static resources.

Cookies – most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots.

JS challenges – for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection.

Detection logs – we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models.

Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation.

Firewall integration

The Cloudflare Firewall’s intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold.

Cloudflare Bot Management: machine learning and more

Firewall rules provide the following bot mitigation actions:

  • Log – records matching requests in the Cloudflare Logs provided to customers.
  • Bypass – allows customers to dynamically disable Cloudflare security features for a request.
  • Allow – matching requests are exempt from challenge and block actions triggered by other Firewall Rules content.
  • Challenge (Captcha) – useful for ensuring that the visitor accessing the site is human, and not automated.
  • JS Challenge – useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically.
  • Block – matching requests are denied access to the site.

Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage.

Cloudflare Workers integration

In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})
 
async function handleRequest(request) {
  request = new Request(request);
 
  request.headers.set("Cf-Bot-Score", request.cf.bot_management.score)
 
  return fetch(request);
}

Now let’s have a look into how a single score is produced using multiple detection mechanisms.

Detection mechanisms

Cloudflare Bot Management: machine learning and more

The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per customer basis to better fit their needs.

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Having a score on every request for every customer has the following benefits:

  • Ease of onboarding – even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity.
  • Feedback loop – availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms.
  • Ensures scaling – if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer.
  • Global bot insights – Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time.

Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%!

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management.

Machine learning

The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers.

At the core of the machine learning detection mechanism is CatBoost  – a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities:

  • Categorical features support – allowing us to train on even very high cardinality features.
  • Superior accuracy – allowing us to reduce overfitting by using a novel gradient-boosting scheme.
  • Inference speed – in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast.
  • C and Rust API – most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic.

There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform.

We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer.

A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally.

Heuristics engine

Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives.

The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics – we assign the lowest possible score of 1.

The engine has the following properties:

  • Speed – if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds!
  • Deployability – the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request.
  • Vast coverage – using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right?
  • Lowest false positives – because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms.
  • Labels for ML – because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy.

So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers.

Behavioral analysis

Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior?

Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties:

  • Fitting specific customer needs – it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time.
  • Detects bots never seen before – as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website.
  • Harder to evade – anomalous behavior is often a direct result of the bot’s specific goal.

Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks.

Verified bots

So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website.

The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties:

  • Validator based approach – we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs.
  • Reverse DNS validator – performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname.
  • ASN Block validator – similar to rDNS check, but performed on ASN block.
  • Downloader validator – collects good bot IPs from either text files or HTML pages hosted on bot owner sites.
  • Machine learning validator – uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means.
  • Bots Directory – a database with UI that stores and manages bots that pass through the Cloudflare network.
Cloudflare Bot Management: machine learning and more
Bots directory UI sample‌‌

Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories.

JS fingerprinting

When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections?

As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API.

This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as  producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it.

This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up!

Summary

Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF  and Workers.

Cloudflare Bot Management: machine learning and more

All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform.

Cloudflare Bot Management: machine learning and more
Bots team during Austin team summit 2019 hunting bots with axes 🙂

Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management!

Building a Scalable Document Pre-Processing Pipeline

Post Syndicated from Joel Knight original https://aws.amazon.com/blogs/architecture/building-a-scalable-document-pre-processing-pipeline/

In a recent customer engagement, Quantiphi, Inc., a member of the Amazon Web Services Partner Network, built a solution capable of pre-processing tens of millions of PDF documents before sending them for inference by a machine learning (ML) model. While the customer’s use case—and hence the ML model—was very specific to their needs, the pipeline that does the pre-processing of documents is reusable for a wide array of document processing workloads. This post will walk you through the pre-processing pipeline architecture.

Pre-processing pipeline architecture-SM

Architectural goals

Quantiphi established the following goals prior to starting:

  • Loose coupling to enable independent scaling of compute components, flexible selection of compute services, and agility as the customer’s requirements evolved.
  • Work backwards from business requirements when making decisions affecting scale and throughput and not simply because “fastest is best.” Scale components only where it makes sense and for maximum impact.
  •  Log everything at every stage to enable troubleshooting when something goes wrong, provide a detailed audit trail, and facilitate cost optimization exercises by identifying usage and load of every compute component in the architecture.

Document ingestion

The documents are initially stored in a staging bucket in Amazon Simple Storage Service (Amazon S3). The processing pipeline is kicked off when the “trigger” Amazon Lambda function is called. This Lambda function passes parameters such as the name of the staging S3 bucket and the path(s) within the bucket which are to be processed to the “ingestion app.”

The ingestion app is a simple application that runs a web service to enable triggering a batch and lists documents from the S3 bucket path(s) received via the web service. As the app processes the list of documents, it feeds the document path, S3 bucket name, and some additional metadata to the “ingest” Amazon Simple Queue Service (Amazon SQS) queue. The ingestion app also starts the audit trail for the document by writing a record to the Amazon Aurora database. As the document moves downstream, additional records are added to the database. Records are joined together by a unique ID and assigned to each document by the ingestion app and passed along throughout the pipeline.

Chunking the documents

In order to maximize grip and control, the architecture is built to submit single-page files to the ML model. This enables correlating an inference failure to a specific page instead of a whole document (which may be many pages long). It also makes identifying the location of features within the inference results an easier task. Since the documents being processed can have varied sizes, resolutions, and page count, a big part of the pre-processing pipeline is to chunk a document up into its component pages prior to sending it for inference.

The “chunking orchestrator” app repeatedly pulls a message from the ingest queue and retrieves the document named therein from the S3 bucket. The PDF document is then classified along two metrics:

  • File size
  • Number of pages

We use these metrics to determine which chunking queue the document is sent to:

  • Large: Greater than 10MB in size or greater than 10 pages
  • Small: Less than or equal to 10MB and less than or equal to 10 pages
  • Single page: Less than or equal to 10MB and exactly one page

Each of these queues is serviced by an appropriately sized compute service that breaks the document down into smaller pieces, and ultimately, into individual pages.

  • Amazon Elastic Cloud Compute (EC2) processes large documents primarily because of the high memory footprint needed to read large, multi-gigabyte PDF files into memory. The output from these workers are smaller PDF documents that are stored in Amazon S3. The name and location of these smaller documents is submitted to the “small documents” queue.
  • Small documents are processed by a Lambda function that decomposes the document into single pages that are stored in Amazon S3. The name and location of these single page files is sent to the “single page” queue.

The Dead Letter Queues (DLQs) are used to hold messages from their respective size queue which are not successfully processed. If messages start landing in the DLQs, it’s an indication that there is a problem in the pipeline. For example, if messages start landing in the “small” or “single page” DLQ, it could indicate that the Lambda function processing those respective queues has reached its maximum run time.

An Amazon CloudWatch Alarm monitors the depth of each DLQ. Upon seeing DLQ activity, a notification is sent via Amazon Simple Notification Service (Amazon SNS) so an administrator can then investigate and make adjustments such as tuning the sizing thresholds to ensure the Lambda functions can finish before reaching their maximum run time.

In order to ensure no documents are left behind in the active run, there is a failsafe in the form of an Amazon EC2 worker that retrieves and processes messages from the DLQs. This failsafe app breaks a PDF all the way down into individual pages and then does image conversion.

For documents that don’t fall into a DLQ, they make it to the “single page” queue. This queue drives each page through the “image conversion” Lambda function which converts the single page file from PDF to PNG format. These PNG files are stored in Amazon S3.

Sending for inference

At this point, the documents have been chunked up and are ready for inference.

When the single-page image files land in Amazon S3, an S3 Event Notification is fired which places a message in a “converted image” SQS queue which in turn triggers the “model endpoint” Lambda function. This function calls an API endpoint on an Amazon API Gateway that is fronting the Amazon SageMaker inference endpoint. Using API Gateway with SageMaker endpoints avoided throttling during Lambda function execution due to high volumes of concurrent calls to the Amazon SageMaker API. This pattern also resulted in a 2x inference throughput speedup. The Lambda function passes the document’s S3 bucket name and path to the API which in turn passes it to the auto scaling SageMaker endpoint. The function reads the inference results that are passed back from API Gateway and stores them in Amazon Aurora.

The inference results as well as all the telemetry collected as the document was processed can be queried from the Amazon Aurora database to build reports showing number of documents processed, number of documents with failures, and number of documents with or without whatever feature(s) the ML model is trained to look for.

Summary

This architecture is able to take PDF documents that range in size from single page up to thousands of pages or gigabytes in size, pre-process them into single page image files, and then send them for inference by a machine learning model. Once triggered, the pipeline is completely automated and is able to scale to tens of millions of pages per batch.

In keeping with the architectural goals of the project, Amazon SQS is used throughout in order to build a loosely coupled system which promotes agility, scalability, and resiliency. Loose coupling also enables a high degree of grip and control over the system making it easier to respond to changes in business needs as well as focusing tuning efforts for maximum impact. And with every compute component logging everything it does, the system provides a high degree of auditability and introspection which facilitates performance monitoring, and detailed cost optimization.

Formula 1: Using Amazon SageMaker to Deliver Real-Time Insights to Fans

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/formula-1-using-amazon-sagemaker-to-deliver-real-time-insights-to-fans-live/

The Formula one Group (F1) is responsible for the promotion of the FIA Formula One World Championship, a series of auto racing events in 21 countries where professional drivers race single-seat cars on custom tracks or through city courses in pursuit of the World Championship title.

Formula 1 works with AWS to enhance its race strategies, data tracking systems, and digital broadcasts through a wide variety of AWS services—including Amazon SageMaker, AWS Lambda, and AWS analytics services—to deliver new race metrics that change the way fans and teams experience racing.

In this special live segment of This is My Architecture, you’ll get a look at what’s under the hood of Formula 1’s F1 Insights. Hear about the machine learning algorithms the company trains on Amazon SageMaker and how inferences are made during races to deliver insights to fans.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Delve into the Forumla 1 case study to learn more about how AWS fuels analytics through machine learning.

Essential Suite — Artwork Producer Assistant

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/essential-suite-artwork-producer-assistant-8f2a760bc150

Essential Suite — Artwork Producer Assistant

By: Hamid Shahid & Syed Haq

Introduction

Netflix continues to invest in content for a global audience with a diverse range of unique tastes and interests. Correspondingly, the member experience must also evolve to connect this global audience to the content that most appeals to each of them. Images that represent titles on Netflix (what we at Netflix call “artwork”) have proven to be one of the most effective ways to help our members discover the content they love to watch. We thus need to have a rich and diverse set of artwork that is tailored for different parts of the Netflix experience (what we call product canvases). We also need to source multiple images for each title representing different themes so we can present an image that is relevant to each member’s taste.

Manual curation and review of these high quality images from scratch for a growing catalog of titles can be particularly challenging for our Product Creative Strategy Producers (referred to as producers in the rest of the article). Below, we discuss how we’ve built upon our previous work of harvesting static images directly from video source files and our computer vision algorithms to produce a set of artwork candidates that covers the major product canvases for the entire content catalog. The artwork generated by this pipeline is used to augment the artwork typically sourced from design agencies. We call this suite of assisted artwork “The Essential Suite”.

Supplement, not replace

Producers from our Creative Production team are the ultimate decision makers when it comes to the selection of artwork that gets published for each title. Our usage of computer vision to generate artwork candidates from video sources thus is focussed on alleviating the workload for our Creative Production team. The team would rather spend its time on creative and strategic tasks rather than sifting through thousands of frames of a show looking for the most compelling ones. With the “Essential Suite”, we are providing an additional tool in the producers toolkit. Through testing we have learned that with proper checks and human curation in place, assisted artwork candidates can perform on par with agency designed artwork.

Design Agencies

Netflix uses best-in-class design agencies to provide artwork that can be used to promote titles on and off the Netflix service. Netflix producers work closely with design agencies to request, review and approve artwork. All artwork is delivered through a web application made available to the design agencies.

The computer generated artwork can be considered as artwork provided by an “Internal agency”. The idea is to generate artwork candidates using video source files and “bubble it up” to the producers on the same artwork portal where they review all other artwork, ideally without knowing if it is an agency produced or internally curated artwork, thereby selecting what goes on product purely based on creative quality of the image.

Assisted Artwork Generation Workflow

The artwork generation process involves several steps, starting with the arrival of the video source files and culminating in generated artwork being made available to producers. We use an open source workflow engine Netflix Conductor to run the orchestration. The whole process can be divided into two parts

  1. Generation
  2. Review

1. Generation

This article on AVA provides a good explanation on our technology to extract interesting images from video source files. The artwork generation workflow takes it a step further. For a given product canvas, it selects a handful of images from the hundreds of video stills most suitable for that particular product canvas. The workflow then crops and color-corrects the selected image, picks out the best spot to place the movie’s title based on negative space, selects and resizes the movie title and places it onto the image.

Here is an illustration of what it means if we had to do it manually

a. Image selection
b. Identify areas of interest
c. Cropped, color-corrected & title placed in the negative space

Image Selection / Analyze Image

Selection of the right still image is essential to generating good quality artwork. A lot of work has already been done in AVA to extract out a few hundreds of frames from hundreds of thousands of frames present in a typical video source. Broadly speaking, we use two methods to extract movie stills out of video source.

  1. AVA — Ava is primarily a character based algorithm. It picks up frames with a clear facial shot taking into account actors, facial expression and shot detection.
  2. Cinematics — Cinematics picks up aesthetically pleasing cinematic shots.

The combination of these two approaches produce a few hundred movie stills from a typical video source. For a season, this would be a few hundred shots for each episode. Our work here is to pick up the stills that best work for the desired canvas.

Both of the above algorithms use a few heatmaps which define what kind of images have proven to be working best in different canvases. The heatmaps are designed by internal artists who are experienced in designing promotional artwork/posters

Heatmap for a Billboard

We make use of meta-information such as the size of desired canvas, the “unsafe regions” and the “regions of interest” to identify what image would serve best. “Unsafe regions” are areas in the image where badges such as Netflix logo, new episodes, etc are placed. “Regions of interest” are areas that are always displayed in multi-purpose canvases. These details are stored as metadata for each canvas type and passed to the algorithm by the workflow. Some of our canvases are cropped dynamically for different user interfaces. For such images, the “Regions of interest” will be the area that is always displayed in each crop.

Unsafe regions

This data-driven approach allows for fast turnaround for additional canvases. While selecting images, the algorithms also returns back suggested coordinates within each image for cropping and title placement. Finally, it associates a “score” with the selected image. This score is the “confidence” that the algorithm has on the selection of candidate image on how well it could perform on service, based on previously collected stats.

Image Creation

The artwork generation workflow collates image selection results from each video source and picks up the top “n” images based on confidence score.

The selected image is then cropped and color-corrected based on coordinates passed by the algorithm. Some canvases also need the movie title to be placed on the image. The process makes use of the heatmap provided by our designers to perform cropping and title placement. As an example, the “Billboard” canvas shown on a movie’s landing page is right aligned, with the title and synopsis shown on the left.

Billboard Canvas

The workers to crop and color correct images are made available as separate titus jobs. The workflow invokes the jobs, storing each output in the artwork asset management system and passes it on for review.

2. Review

For each artwork candidate generated by the workflow, we want to get as much feedback as possible from the Creative Production team because they have the most context about the title. However, getting producers to provide feedback on hundreds of generated images is not scalable. For this reason, we have split the review process in two rounds.

Technical Quality Control (QC)

This round of review enables filtering out images that look obviously wrong to a human eye. Images with features such as human actors with an open mouth, inappropriate facial expressions or an incorrect body position, etc are filtered out in this round.

For the purpose of reviewing these images, we use a video/image annotation application that provides a simple interface to add tags for a given list of videos or images. For our purposes, for each image, we ask the very basic question “Should this image be used for artwork?”

The team reviewing these assets treat each image individually and only look for technical aspects of the image, regardless of the theme or genre of the title, or the quantity of images presented for a given title.

When an image is rejected, a few follow up questions are asked to ascertain why the image is not suitable to be used as artwork.

All this review data is fed back to the image selection, cropping and color corrections algorithms to train and improve them.

Editorial QC

Unlike technical QC, which is title agnostic, editorial QC is done by producers who are deeply familiar with the themes, storylines and characters in the title, to select artwork that will represent the title best on the Netflix service.

The application used to review generated artwork is the same application that producers use to place and review artwork requests fulfilled by design agencies. A screenshot of how generated artwork is presented to producers is shown below

Similar to technical QC, the option here for each artwork is whether to approve or reject the artwork. The producers are encouraged to provide reasons why they are rejecting an artwork.

Approved artwork makes its way to the artwork’s asset management system, where it resides alongside other agency-fulfilled artwork. From here, producers have the ability to publish it to the Netflix service.

Conclusion

We have learned a lot from our work on generating artwork. Artwork that looks good might not be the best depiction of the title’s story, a very clear character image might be a content spoiler. All of these decisions are best made by humans and we intend to keep it that way.

However, assisted artwork generation has a place in supporting our creative team by providing them with another avenue to pick up their assets from, and with careful supervision will help in their challenge of sourcing artwork at scale.


Essential Suite — Artwork Producer Assistant was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Build machine learning-powered business intelligence analyses using Amazon QuickSight

Post Syndicated from Osemeke Isibor original https://aws.amazon.com/blogs/big-data/build-machine-learning-powered-business-intelligence-analyses-using-amazon-quicksight/

Imagine you can see the future—to know how many customers will order your product months ahead of time so you can make adequate provisions, or to know how many of your employees will leave your organization several months in advance so you can take preemptive actions to encourage staff retention. For an organization that sees the future, the possibilities are limitless. Machine learning (ML) makes it possible to predict the future with a higher degree of accuracy.

Amazon SageMaker provides every developer and data scientist the ability to build, train, and deploy ML models quickly, but for business users who usually work on creating business intelligence dashboards and reports rather than ML models, Amazon QuickSight is the service of choice. With Amazon QuickSight, you can still use ML to forecast the future. This post goes through how to create business intelligence analyses that use ML to forecast future data points and detect anomalies in data, with no technical expertise or ML experience needed.

Overview of solution

Amazon QuickSight ML Insights uses AWS-proven ML and natural language capabilities to help you gain deeper insights from your data. These powerful, out-of-the-box features make it easy to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical or ML experience. You can use ML insights in sales reporting, web analytics, financial planning, and more. You can detect insights buried in aggregates, perform interactive what-if analysis, and discover what activities you need to meet business goals.

This post imports data from Amazon S3 into Amazon QuickSight and creates ML-powered analyses with the imported data. The following diagram illustrates this architecture.

Walkthrough

In this walkthrough, you create an Amazon QuickSight analysis that contains ML-powered visuals that forecast the future demand for taxis in New York City. You also generate ML-powered insights to detect anomalies in your data. This post uses the New York City Taxi and Limousine Commission (TLC) Trip Record Data on the Registry of Open Data on AWS.

The walkthrough includes the following steps:

  1. Set up and import data into Amazon QuickSight
  2. Create an ML-powered visual to forecast the future demand for taxis
  3. Generate an ML-powered insight to detect anomalies in the data set

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Amazon Quicksight Enterprise edition
  • Basic knowledge of AWS

Setting up and importing data into Amazon QuickSight

Set up Amazon Quicksight as an individual user. Complete the following steps:

  1. On the AWS Management Console, in the Region list, select US East (N. Virginia) or any Region of your choice that Amazon QuickSight
  2. Under Analytics, for Services, choose Amazon QuickSight.If you already have an existing Amazon QuickSight account, make sure it is the Enterprise edition; if it is not, upgrade to Enterprise edition. For more information, see Upgrading your Amazon QuickSight Subscription from Standard Edition to Enterprise Edition.If you do not have an existing Amazon QuickSight account, proceed with the setup and make sure you choose Enterprise Edition when setting up the account. For more information, see Setup a Free Standalone User Account in Amazon QuickSight.After you complete the setup, a Welcome Wizard screen appears.
  1. Choose Next on each of the Welcome Wizard screens.
  2. Choose Get Started.

Before you import the data set, make sure that you have at least 3GB of SPICE capacity. For more information, see Managing SPICE Capacity.

Importing the NYC Taxi data set into Amazon QuickSight

The NYC Taxi data set is in an S3 bucket. To import S3 data into Amazon QuickSight, use a manifest file. For more information, see Supported Formats for Amazon S3 Manifest Files. To import your data, complete the following steps:

  1. In a new text file, copy and paste the following code:
    "fileLocations": [
            {
                "URIs": [
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-01.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-02.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-03.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-04.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-05.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-06.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-07.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-08.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-09.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-10.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-11.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-12.csv"
    
                ]
            }
        ],
        "globalUploadSettings": {
            "textqualifier": "\""
        }
    }

  2. Save the text file as nyc-taxi.json.
  3. On the Amazon QuickSight console, choose New analysis.
  4. Choose New data set.
  5. For data source, choose S3.
  6. Under New S3 data source, for Data source name, enter a name of your choice.
  7. For Upload a manifest file field, select Upload.
  8. Choose the nyc-taxi.json file you created earlier.
  9. Choose Connect.
    The S3 bucket this post uses is a public bucket that contains a public data set and open to the public. When using S3 buckets in your account with Amazon QuickSight, it is highly recommended that the buckets are not open to the public; you need to configure authentication to access your S3 bucket from Amazon QuickSight. For more information about troubleshooting, see I Can’t Connect to Amazon S3.After you choose Connect, the Finish data set creation screen appears.
  10. Choose Visualize.
  11. Wait for the import to complete.

You can see the progress on the top right corner of the screen. When the import is complete, the result shows the number of rows imported successfully and the number of rows skipped.

Creating an ML-powered visual

After you import the data set into Amazon QuickSight SPICE, you can start creating analyses and visuals. Your goal is to create an ML-powered visual to forecast the future demand for taxis. For more information, see Forecasting and Creating What-If Scenarios with Amazon Quicksight.

To create your visual, complete the following steps:

  1. From the Data source details pop screen, choose Visualize.
  2. From the field list, select Lpep_pickup_datetime.
  3. Under Visual types, select the first visual.Amazon QuickSight automatically uses the best visual based on the number and data type of fields you selected. From your selection, Amazon Quicksight displays a line chart visual.From the preceding graph, you can see that the bulk of your data clusters are around December 31, 2017, to January 1, 2019, for the Lpep_pickup_datetime field. There are a few data points with date ranges up to June 2080. These values are incorrect and can impact your ML forecasts.To clean up your data set, filter out the incorrect data using the data in the Lpep_pickup_datetime. This post only uses data in which Lpep_pickup_datetime falls between January 1, 2018, and December 18, 2018, because there is a more consistent amount of data within this date range.
  4. Use the filter menu to create a filter using the Lpep_pickup_datetime
  5. Under Filter type, choose Time range and Between.
  6. For Start date, enter 2018-01-01 00:00.
  7. Select Include start date.
  8. For End date, enter 2018-12-18 00:00.
  9. Select Include end date.
  10. Choose Apply.

The line chart should now contain only data with Lpep_pickup_datetime from January 1, 2018, and December 18, 2018. You can now add the forecast for the next 31 days to the line chart visual.

Adding the forecast to your visual

To add your forecast, complete the following steps:

  1. On the visual, choose the arrow.
  2. From the drop-down menu, choose Add forecast.
  3. Under Forecast properties, for Forecast length, for Periods forward, enter 31.
  4. For Periods backwards, enter 0.
  5. For Prediction interval, leave at the default value 90.
  6. For Seasonality, leave at the default selection Automatic.
  7. Choose Apply.

You now see an orange line on your graph, which is the forecasted pickup quantity per day for the next 31 days after December 18, 2018. You can explore the different dates by hovering your cursor over different points on the forecasted pickup line. For example, hovering your cursor over January 10, 2019, shows that the expected forecasted number of pickups for that day is approximately 22,000. The forecast also provides an upper bound (maximum number of pickups forecasted) of about 26,000 and a lower bound (minimum number of pickups forecasted) of about 18,000.

You can create multiple visuals with forecasts and combine them into a sharable Amazon QuickSight dashboard. For more information, see Working with Dashboards.

Generating an ML-powered insight to detect anomalies

In Amazon QuickSight, you can add insights, autonarratives, and ML-powered anomaly detection to your analyses without ML expertise or knowledge. Amazon QuickSight generates suggested insights and autonarratives automatically, but for ML-powered anomaly detection, you need to perform additional steps. For more information, see Using ML-Powered Anomaly Detection.

This post checks if there are any anomalies in the total fare amount over time from select locations. For example, if the total fare charged for taxi rides is about $1,000 and above from the first pickup location (for example, the airport) for most dates in your data set, an anomaly is when the total fare charged deviates from the standard pattern. Anomalies are not necessarily negative, but rather abnormalities that you can choose to investigate further.

To create an anomaly insight, complete the following steps:

  1. From the top right corner of the analysis creation screen, click the Add drop-down menu and choose Add insight.
  2. On the Computation screen, for Computation type, select Anomaly detection.
  3. Under Fields list, choose the following fields:
  • fare_amount
  • lpep_pickup_datetime
  • PULocationID
  1. Choose Get started.
  2. The Configure anomaly detection
  3. Choose Analyze all combinations of these categories.
  4. Leave the other settings as their default.You can now perform a contribution analysis and discover how the drop-off location contributed to the anomalies. For more information, see Viewing Top Contributors.
  5. Under Contribution analysis, choose DOLocationID.
  6. Choose Save.
  7. Choose Run Now.The anomaly detection can take up to 10 minutes to complete. If it is still running after about 10 minutes, your browser may have timed out. Refresh your browser and you should see the anomalies displayed in the visual.
  8. Choose Explore anomalies.

By default, the anomalies you see are for the last date in your data set. You can explore the anomalies across the entire date range of your data set by choosing SHOW ANOMALIES BY DATE and dragging the slider at the bottom of the visual to display the entire date range from January 1, 2018, to December 30, 2018.

This graph shows that March 21, 2018, has the highest number of anomalies of the fare charged in the entire data set. For example, the total fare amount charged by taxis that picked up passengers from location 74 on March 21, 2018, was 7,181. This is -64% (about 19,728.5) of the total fare charged by taxis for the same pickup location on March 20, 2018. When you explore the anomalies of other pickup locations for that same date, you can see that they all have similar drops in the total fare charged. You can also see the top DOLocationID contributors to these anomalies.

What happened in New York City on March 21, 2018, to cause this drop? A quick online search reveals that New York City experienced a severe weather condition on March 21, 2018.

Publishing your analyses to a dashboard, sharing the dashboard, and setting up email alerts

You can create additional visuals to your analyses, publish the analyses as a dashboard, and share the dashboard with other users. QuickSight anomaly detection allows you to uncover hidden insights in your data by continuously analyzing billions of data points. You can subscribe to receive alerts to your inbox if an anomaly occurs in your business metrics. The email alert also indicates the factors that contribute to these anomalies. This allows you to act immediately on the business metrics that need attention.

From the QuickSight dashboard, you can configure an anomaly alert to be sent to your email with Severity set to High and above and Direction set to Lower than expected. Also make sure to schedule a data refresh so that the anomaly detection runs on your most recent data. For more information, see Refreshing Data.

Cleaning up

To avoid incurring future charges, you need to cancel your Amazon Quicksight subscription.

Conclusion

This post walked you through how to use ML-powered insights with Amazon QuickSight to forecast future data points, detect anomalies, and derive valuable insights from your data without needing prior experience or knowledge of ML. If you want to do more forecasting without ML experience, check out Amazon Forecast.

If you have questions or suggestions, please leave a comment.

 


About the Author

Osemeke Isibor is a partner solutions architect at AWS. He works with AWS Partner Network (APN) partners to design secure, highly available, scalable and cost optimized solutions on AWS. He is a science-fiction enthusiast and a fan of anime.

 

 

 

Open-Sourcing Metaflow, a Human-Centric Framework for Data Science

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9?source=rss----2615bd06b42e---4

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Savin Goyal, Ferras Hamad, Ville Tuulos

tl;dr Metaflow is now open-source! Get started at metaflow.org.

Netflix applies data science to hundreds of use cases across the company, including optimizing content delivery and video encoding. Data scientists at Netflix relish our culture that empowers them to work autonomously and use their judgment to solve problems independently. We want our data scientists to be curious and take smart risks that have the potential for high business impact.

About two years ago, we, at our newly formed Machine Learning Infrastructure team started asking our data scientists a question: “What is the hardest thing for you as a data scientist at Netflix?” We were expecting to hear answers related to large-scale data and models, and maybe issues related to modern GPUs. Instead, we heard stories about projects where getting the first version to production took surprisingly long — mainly because of mundane reasons related to software engineering. We heard many stories about difficulties related to data access and basic data processing. We sat in meetings where data scientists discussed with their stakeholders how to best version different versions of their models without impacting production. We saw how excited data scientists were about modern off-the-shelf machine learning libraries, but we also witnessed various issues caused by these libraries when they were casually included as dependencies in production workflows.

We realized that nearly everything that data scientists wanted to do was already doable technically, but nothing was easy enough. Our job as a Machine Learning Infrastructure team would therefore not be mainly about enabling new technical feats. Instead, we should make common operations so easy that data scientists would not even realize that they were difficult before. We would focus our energy solely on improving data scientist productivity by being fanatically human-centric.

How could we improve the quality of life for data scientists? The following picture started emerging:

Our data scientists love the freedom of being able to choose the best modeling approach for their project. They know that feature engineering is critical for many models, so they want to stay in control of model inputs and feature engineering logic. In many cases, data scientists are quite eager to own their own models in production, since it allows them to troubleshoot and iterate the models faster.

On the other hand, very few data scientists feel strongly about the nature of the data warehouse, the compute platform that trains and scores their models, or the workflow scheduler. Preferably, from their point of view, these foundational components should “just work”. If, and when they fail, the error messages should be clear and understandable in the context of their work.

A key observation was that most of our data scientists had nothing against writing Python code. In fact, plain-and-simple Python is quickly becoming the lingua franca of data science, so using Python is preferable to domain specific languages. Data scientists want to retain their freedom to use arbitrary, idiomatic Python code to express their business logic — like they would do in a Jupyter notebook. However, they don’t want to spend too much time thinking about object hierarchies, packaging issues, or dealing with obscure APIs unrelated to their work. The infrastructure should allow them to exercise their freedom as data scientists but it should provide enough guardrails and scaffolding, so they don’t have to worry about software architecture too much.

Introducing Metaflow

These observations motivated Metaflow, our human-centric framework for data science. Over the past two years, Metaflow has been used internally at Netflix to build and manage hundreds of data-science projects from natural language processing to operations research.

By design, Metaflow is a deceptively simple Python library:

Data scientists can structure their workflow as a Directed Acyclic Graph of steps, as depicted above. The steps can be arbitrary Python code. In this hypothetical example, the flow trains two versions of a model in parallel and chooses the one with the highest score.

On the surface, this doesn’t seem like much. There are many existing frameworks, such as Apache Airflow or Luigi, which allow execution of DAGs consisting of arbitrary Python code. The devil is in the many carefully designed details of Metaflow: for instance, note how in the above example data and models are stored as normal Python instance variables. They work even if the code is executed on a distributed compute platform, which Metaflow supports by default, thanks to Metaflow’s built-in content-addressed artifact store. In many other frameworks, loading and storing of artifacts is left as an exercise for the user, which forces them to decide what should and should not be persisted. Metaflow removes this cognitive overhead.

Metaflow is packed with human-centric details like this, all of which aim at boosting data scientist productivity. For a comprehensive overview of all features of Metaflow, take a look at our documentation at docs.metaflow.org.

Metaflow on Amazon Web Services

Netflix’s data warehouse contains hundreds of petabytes of data. While a typical machine learning workflow running on Metaflow touches only a small shard of this warehouse, it can still process terabytes of data.

Metaflow is a cloud-native framework. It leverages elasticity of the cloud by design — both for compute and storage. Netflix has been one of the largest users of Amazon Web Services (AWS) for many years and we have accumulated plenty of operational experience and expertise in dealing with the cloud, AWS in particular. For the open-source release, we partnered with AWS to provide a seamless integration between Metaflow and various AWS services.

Metaflow comes with built-in capability to snapshot all code and data in Amazon S3 automatically, which is a key value proposition of our internal Metaflow setup. This provides us with a comprehensive solution for versioning and experiment tracking without any user intervention, which is core to any production-grade machine learning infrastructure.

In addition, Metaflow comes bundled with a high-performance S3 client, which can load data up to 10Gbps. This client has been massively popular amongst our users, who can now load data into their workflows an order of magnitude faster than before, enabling faster iteration cycles.

For general purpose data processing, Metaflow integrates with AWS Batch, which is a managed, container-based compute platform provided by AWS. The user can benefit from infinitely scalable compute clusters by adding a single line in their code: @batch. For training machine learning models, besides writing their own functions, the user has the choice to use AWS Sagemaker, which provides high-performance implementations of various models, many of which support distributed training.

Metaflow supports all common off-the-shelf machine learning frameworks through our @conda decorator, which allows the user to specify external dependencies for their steps safely. The @conda decorator freezes the execution environment, providing good guarantees of reproducibility, both when executed locally as well as in the cloud.

For more details, read this page about Metaflow’s integration with AWS.

From Prototype To Production

Out of the box, Metaflow provides a first-class local development experience. It allows data scientists to develop and test code quickly on your laptop, similar to any Python script. If your workflow supports parallelism, Metaflow takes advantage of all CPU cores available on your development machine.

We encourage our users to deploy their workflows to production as soon as possible. In our case, “production” means a highly available, centralized DAG scheduler, Meson, where users can export their Metaflow runs for execution with a single command. This allows them to start testing their workflow with regularly updating data quickly, which is a highly effective way to surface bugs and issues in the model. Since Meson is not available in open-source, we are working on providing a similar integration to AWS Step Functions, which is a highly available workflow scheduler.

In a complex business environment like Netflix’s, there are many ways to consume the results of a data science workflow. Often, the final results are written to a table, to be consumed by a dashboard. Sometimes, the resulting model is deployed as a microservice to support real-time inferencing. It is also common to chain workflows so that the results of a workflow are consumed by another. Metaflow supports all these modalities, although some of these features are not yet available in the open-source version.

When it comes to inspecting the results, Metaflow comes with a notebook-friendly client API. Most of our data scientists are heavy users of Jupyter notebooks, so we decided to focus our UI efforts on a seamless integration with notebooks, instead of providing a one-size-fits-all Metaflow UI. Our data scientists can build custom model UIs in notebooks, fetching artifacts from Metaflow, which provide just the right information about each model. A similar experience is available with AWS Sagemaker notebooks with open-source Metaflow.

Get Started With Metaflow

Metaflow has been eagerly adopted inside of Netflix, and today, we are making Metaflow available as an open-source project.

We hope that our vision of data scientist autonomy and productivity resonates outside Netflix as well. We welcome you to try Metaflow, start using it in your organization, and participate in its development.

You can find the project home page at metaflow.org and the code at github.com/Netflix/metaflow. Metaflow is comprehensively documented at docs.metaflow.org. The quickest way to get started is to follow our tutorial. If you want to learn more before getting your hands dirty, you can watch presentations about Metaflow at the high level or dig deeper into the internals of Metaflow.

If you have any questions, thoughts, or comments about Metaflow, you can find us at Metaflow chat room or you can reach us by email at [email protected]. We are eager to hear from you!


Open-Sourcing Metaflow, a Human-Centric Framework for Data Science was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Open-sourcing Polynote: an IDE-inspired polyglot notebook

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447?source=rss----2615bd06b42e---4

Jeremy Smith, Jonathan Indig, Faisal Siddiqi

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.

Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

At Netflix, we have always felt strongly about sharing with the open source community, and believe that Polynote has a great potential to address similar needs outside of Netflix.

Feature Overview

Reproducibility

Polynote promotes notebook reproducibility by design. By taking a cell’s position in the notebook into account when executing it, Polynote helps prevent bad practices that make notebooks difficult to re-run from the top.

Editing Improvements

Polynote provides IDE-like features such as interactive autocomplete and parameter hints, in-line error highlighting, and a rich text editor with LaTeX support.

Visibility

The Polynote UI provides at-a-glance insights into the state of the kernel by showing kernel status, highlighting currently-running cell code, and showing currently executing tasks.

Polyglot

Each cell in a notebook can be written in a different language with variables shared between them. Currently Scala, Python, and SQL cell types are supported.

Dependency and Configuration Management

Polynote provides configuration and dependency setup saved within the notebook itself, and helps solve some of the dependency problems commonly experienced by Spark developers.

Data Visualization

Native data exploration and visualization helps users learn more about their data without cluttering their notebooks. Integration with matplotlib and Vega allows power users to communicate with others through beautiful visualizations

Reimagining the Scala notebook experience

On the Netflix Personalization Infrastructure team, our job is to accelerate machine learning innovation by building tools that can remove pain points and allow researchers to focus on research. Polynote originated from a frustration with the shortcomings of existing notebook tools, especially with respect to their support of Scala.

For example, while Python developers are used to working inside an environment constructed using a package manager with a relatively small number of dependencies, Scala developers typically work in a project-based environment with a build tool managing hundreds of (often) conflicting dependencies. With Spark, developers are working in a cluster computing environment where it is imperative that their distributed code runs in a consistent environment no matter which node is being used. Finally, we found that our users were also frustrated with the code editing experience within notebooks, especially those accustomed to using IntelliJ IDEA or Eclipse.

Some problems are unique to the notebook experience. A notebook execution is a record of a particular piece of code, run at a particular point in time, in a particular environment. This combination of code, data and execution results into a single document makes notebooks powerful, but also difficult to reproduce. Indeed, the scientific computing community has documented some notebook reproducibility concerns as well as some best practices for reproducible notebooks.

Finally, another problem that might be unique to the ML space is the need for polyglot support. Machine learning researchers often work in multiple programming languages — for example, researchers might use Scala and Spark to generate training data (cleaning, subsampling, etc), while actual training might be done with popular Python ML libraries like tensorflow or scikit-learn.

Next, we’ll go through a deeper dive of Polynote’s features.

Reproducible by Design

Two of Polynote’s guiding principles are reproducibility and visibility. To further these goals, one of our earliest design decisions was to build Polynote’s code interpretation from scratch, rather than relying on a REPL like a traditional notebook.

We feel that while REPLs are great in general, they are fundamentally unfit for the notebook model. In order to understand the problems with REPLs and notebooks, let’s take a look at the design of a typical notebook environment.

A notebook is an ordered collection of cells, each of which can hold code or text. The contents of each cell can be modified and executed independently. Cells can be rearranged, inserted, and deleted. They can also depend on the output of other cells in the notebook.

Contrast this with a REPL environment. In a REPL session, a user inputs expressions into the prompt one at a time. Once evaluated, expressions and the results of their evaluation are immutable. Evaluation results are appended to the global state available to the next expression.

Unfortunately, the disconnect between these two models means that a typical notebook environment, which uses a REPL session to evaluate cell code, causes hidden state to accrue as users interact with the notebook. Cells can be executed in any order, mutating this global hidden state that in turn affects the execution of other cells. More often than not, notebooks are unable to be reliably rerun from the top, which makes them very difficult to reproduce and share with others. The hidden state also makes it difficult for users to reason about what’s going on in the notebook.

In other notebooks, hidden state means that a variable is still available after its cell is deleted.
In a Polynote notebook, there is no hidden state. A deleted cell’s variables are no longer available.

Writing Polynote’s code interpretation from scratch allowed us to do away with this global, mutable state. By keeping track of the variables defined in each cell, Polynote constructs the input state for a given cell based on the cells that have run above it. Making the position of a cell important in its execution semantics enforces the principle of least surprise, allowing users to read the notebook from top to bottom. It ensures reproducibility by making it far more likely that running the notebook sequentially will work.

Better editing

Let’s face it — for someone used to IDEs, writing a nontrivial amount of code in a notebook can feel like going back in time a few decades. We’ve seen users who prefer to write code in an IDE instead, and paste it into the notebook to run. While it’s not our goal to provide all the features of a full-fledged modern IDE, there are a few quality-of-life code editing enhancements that go a long way toward improving usability.

Code editing in Polynote integrates with the Monaco editor for interactive auto-complete.
Polynote highlights errors inside the code to help users quickly figure out what’s gone wrong.
Polynote provides a rich text editor for text cells.
The rich text editor allows users to easily insert LaTeX equations.

Visibility

As we mentioned earlier, visibility is one of Polynote’s guiding principles. We want it to be easy to see what the kernel is doing at any given time, without needing to dive into logs. To that end, Polynote provides a variety of UI treatments that let users know what’s going on.

Here’s a snapshot of Polynote in the midst of some code execution.

There’s quite a bit of information available to the user from a single glance at this UI. First, it is clear from both the notebook view and task list that Cell 1 is currently running. We can also see that Cells 2 through 4 are queued to be run, in that order.

We can also see the exact statement currently being run is highlighted in blue — the line defining the value `sumOfRandomNumbers`. Finally, since evaluating that statement launches a Spark job, we can also see job- and stage-level Spark progress information in the task list..

Here’s an animation of that execution so we can see how Polynote makes it easy to follow along with the state of the kernel.

Executing a Polynote notebook

The symbol table provides insight into the notebook internal state. When a cell is selected, the symbol table shows any values that resulted from the current cell’s execution above a black line, and any values available to the cell (from previous cells) below the line. At the end of the animation, we show the symbol table updating as we click on each cell in turn.

Finally, the kernel status area provides information about the execution status of the kernel. Below, we show a closeup view of how the kernel status changes from idle and connected, in green, to busy, in yellow. Other states include disconnected, in gray, and dead or not started, in red.

Kernel status changing from green (idle and connected) to yellow (busy)

Polyglot

You may have noticed in the screenshots shown earlier that each cell has a language dropdown in its toolbar. That’s because Polynote supports truly polyglot notebooks, where each cell can be written in a different language!

When a cell is run, the kernel provides the available typed input values to the cell’s language interpreter. In turn, the interpreter provides the resulting typed output values back to the kernel. This allows cells in Polynote notebooks to operate within the same context, and use the same shared state, regardless of which language they are defined in — so users can pick the best tool for the job at hand.

Here’s an example using scikit-learn, a Python library, to compute an isotonic regression of a dataset generated with Scala. This code is adapted from the Isotonic Regression example on the scikit-learn website.

A polyglot example showing data generation in Scala and data analysis in Python

As this example shows, Polynote enables users to fluently move from one language to another within the same notebook.

Dependency and Configuration Management

In order to better facilitate reproducibility, Polynote stores configuration and dependency information directly in the notebook itself, rather than relying on external files or a cluster/server level configuration. We found that managing dependencies directly in the notebook code was clunky and could be confusing to users. Instead, Polynote provides a user-friendly Configuration section where users can set dependencies for each notebook.

Polynote’s Configuration UI, providing user-friendly, notebook-level configuration and dependency management

With this configuration, Polynote constructs an environment for the notebook. It fetches the dependencies locally (using Coursier or pip to fetch them from a repository) and loads the Scala dependencies into an isolated ClassLoader to reduce the chances of a class conflict with Spark libraries. Python dependencies are loaded into an isolated virtualenv. When Polynote is used in Spark mode, it creates a Spark Session for the notebook which uses the provided configuration. The Python and Scala dependencies are automatically added to the Spark Session.

Data Visualization

One of the most important use cases of notebooks is the ability to explore and visualize data. Polynote integrates with two of the most popular open source visualization libraries, Vega and Matplotlib.

While matplotlib integration is quite standard among notebooks, Polynote also has native support for data exploration — including a data schema view, table inspector, plot constructor and Vega support.

We’ll walk through a quick example of some data analysis and exploration using the tools mentioned above, using the Wine Reviews dataset from Kaggle. First, here’s a quick example of just loading the data in Spark, seeing the Schema, plotting it and saving that plot in the notebook.

Example of data exploration using the plot constructor

Let’s focus on some of what we’re seeing here.

View of the quick inspector, showing the DataFrame’s schema. The blue arrow points to the quick access buttons to the table view (left) and plot view (right)

If the last statement of a cell is an expression, it gets assigned to the cell’s Out variable. Polynote will display a representation of the result in a fashion determined by its data type. If it’s a table-like data type, such as a DataFrame or collection of case classes, Polynote shows the quick inspector, allowing users to see schema and type information at a glance.

The quick inspector also provides two buttons that bring up the full data inspector — the button on the left brings up the table view, while the button on the right brings up the plot constructor. The animation also shows the plot constructor and how users can drag and drop measures and dimensions to create different plots.

We also show how to save a plot to the notebook as its own cell. Because Polynote natively supports Vega specs, saving the plot simply inserts a new Vega cell with a generated spec. As with any other language, Vega specs can leverage polyglot support to refer to values from previous cells. In this case, we’re using the Out value (a DataFrame) and performing additional aggregations on it. This enables efficient plotting without having to bring millions of data points to the client. Polynote’s Vega spec language provides an API for aggregating and otherwise modifying table-like data streams.

A Vega cell generated by the plot constructor, showing its spec

Vega cells don’t need to be authored using the plot constructor — any Vega spec can be put into a Vega cell and plotted directly, as seen below.

Vega’s Stacked Area Chart Example displayed in Polynote

In addition to the cell result value, any variable in the symbol table can be inspected with a click.

Inspecting a variable in the symbol table

The road ahead

We have described some of the key features of Polynote here. We’re proud to share Polynote widely by open sourcing it, and we’d love to hear your feedback. Take it for a spin today by heading over to our website or directly to the code and let us know what you think! Take a look at our currently open issues and to see what we’re planning, and, of course, PRs are always welcome! Polynote is still very much in its infancy, so you may encounter some rough edges. It is also a powerful tool that enables arbitrary code execution (“with great power, comes great responsibility”), so please be cognizant of this when you use it in your environment.

Plenty of exciting work lies ahead. We are very optimistic about the potential of Polynote and we hope to learn from the community just as much as we hope they will find value from Polynote. If you are interested in working on Polynote or other Machine Learning research, engineering and infrastructure problems, check out the Netflix Research site as well as some of the current openings.

Acknowledgements

Many colleagues at Netflix helped us in the early stages of Polynote’s development. We would like to express our tremendous gratitude to Aish Fenton, Hua Jiang, Kedar Sadekar, Devesh Parekh, Christopher Alvino, and many others who provided thoughtful feedback along their journey as early adopters of Polynote.


Open-sourcing Polynote: an IDE-inspired polyglot notebook was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.